Performing joins on very large data sets - csv

I have received several CSV files that I need to merge into a single file, all with a common key I can use to join them. Unfortunately, each of these files are about 5 GB in size (several million rows, about 20-100+ columns), so it's not feasible to just load them up in memory and execute a join against each one, but I do know that I don't have to worry about existing column conflicts between them.
I tried making an index of the row for each file that corresponded to each ID so I could just compute the result without using much memory, but of course that's slow as time itself when actually trying to look up each row, pull the rest of the CSV data from the row, concatenate it to the in-progress data and then write out to a file. This simply isn't feasible, even on an SSD, to process against the millions of rows in each file.
I also tried simply loading up some of the smaller sets in memory and running a parallel.foreach against them to match up the necessary data to dump back out to a temporary merged file. While this was faster than the last method, I simply don't have the memory to do this with the larger files.
I'd ideally like to just do a full left join of the largest of the files, then full left join to each subsequently smaller file so it all merges.
How might I otherwise go about approaching this problem? I've got 24 gb of memory on this system to work with and six cores to work with.
While this might just be a problem to load up in a relational database and do the join there from, I thought I'd reach out before going that route to see if there are any ideas out there on solving this from my local system.
Thanks!

A relational database is the first thing that comes to mind and probably the easiest, but barring that...
Build a hash table mapping key to file offset. Parse the rows on-demand as you're joining. If your keyspace is still too large to fit in available address space, you can put that in a file too. This is exactly what a database index would do (though maybe with a b-tree).
You could also pre-sort the files based on their keys and do a merge join.

The good news is that "several" 5GB files is not a tremendous amount of data. I know it's relative, but the way you describe your system...I still think it's not a big deal. If you weren't needing to join, you could use Perl or a bunch of other command-liney tools.
Are the column names known in each file? Do you care about the column names?
My first thoughts:
Spin up Amazon Web Services (AWS) Elastic MapReduce (EMR) instance (even a pretty small one will work)
Upload these files
Import the files into Hive (as managed or not).
Perform your joins in Hive.
You can spinup an instance in a matter of minutes and be done with the work within an hour or so, depending on your comfort level with the material.
I don't work for Amazon, and can't even use their stuff during my day job, but I use it quite a bit for grad school. It works like a champ when you need your own big data cluster. Again, this isn't "Big Data (R)", but Hive will kill this for you in no time.
This article doesn't do exactly what you need (it copies data from S3); however, it will help you understand table creation, etc.
http://aws.amazon.com/articles/5249664154115844
Edit:
Here's a link to the EMR overview:
https://aws.amazon.com/elasticmapreduce/

I'm not sure if you are manipulating the data. But if just combining the csv's you could try this...
http://www.solveyourtech.com/merge-csv-files/

Related

Scalable Database Plan and Server Selection for Big Amount of Data

I started to work on a startup, where I am goign to write backend. It is my first time that I need to work with this big project and I wanted to ask a few questions to see what is the best practice. I will explain the workflow and informations first than ask for your valuable ideas.
The project is aiming to connect multiple pharmaceutical companies(at most 10) with many (at most 20.000) pharmacies. Pharmacies are supposed to upload screenshots or pdf files, and I need to gather information from these files. Every pharmacy might upload at most 100 screenshots and maybe a few pdf files, but they might do this process for different pharmaceutical companies. Lets say a pharmacy uploads 100 ss and 2 pdfs for company a, company b and company c, so 300 images and 6 pdfs total. Also, reading from pdf or using ocr system for images takes time. Every pdf will have information(Transaction data) about 50 Drugs. I will have Drugs Table and also Transactions table. Every Drug has on avarage 7 transaction. I feel like Transactions table will be huge after sometime and it would be costly to run queries on that big table.
Here are my questions
1)I am planning to use MySQL, would it be enough to serve my purpose?
2)Should I have a seperated database for each company or it is better idea to keep everything in one company.
3)What is the best practice to implement Drugs and Transactions table. Easist way is just to use a foreignkey, but as I told, after some time transactions table will be so big, so maybe there is a better way to plan it.
4)Should I go with a dedicated server or choose a service like AWS.As I said reading from pdf or using ocr takes time.
5)Which storege option would be reasonable for this project. Again, dedicated server or services like AWS storeage.
6)When I read a drug information, it will have drug data and also around 7 transactions. So i need to write to database 8 times per drug. Could there be less costly option?
Thank you very much for your answers :)
Thinking of MySQL:
Thousands of rows is "tiny"; millions is "medium-sized".
A table with a billion rows has some challenges, but it is do-able.
My best advice is to plan on rewriting the entire schema and code in about 4 months. With that advice, you can hastily build something, then learn why it is not optimal; then rebuild it.
Use as many tables as you need. (It sounds like you might need a few dozen. A thousand tables would probably indicate that you are doing something wrong.)
Consider storing images, pdfs, etc in files, not in tables. Then put the file path in a table.
Reading a PDF should be done only once, then capture the OCR'd result into a text file (or MEDIUMTEXT column). After that, you can write code (with or without SQL) to parse it and copy the important data into the database. And you can re-do that when you find that the parsing was not adequate.

Flat File as Input - MySQL Best Practice

I receive a flat file (CSV) each day, the contents of which gets imported into my database (rather than data entry through a web form, POS or the like). There are 40 fields in a record and I'm up to 600,000 unique records.
Up until now, I haven't seen the need to make this a relational database though there certainly is some normalization that would make it more efficient; repeating products, stores, customers, resellers, etc.
If I was starting this from the beginning and incrementally inputting the data somehow, I'd know how to do all that (every resource I've gone through covers it that way but none cover it when you have a large volume of data already and need to make it relational). And with the CVS's coming in each day I'm not quite sure how to import the data once the database is set up. If I were to split those 40 fields into say 5 tables would I then have to split that daily file the same way and import them one at a time? Would foreign keys update that way?
If someone could push me in the right direction I'll go do more digging on my own.
If you were faced with the same project, how would you create such a database and perform the daily updates?
Thanks!
Create your database structure independently of what you have right now (CSV structure and data). E.g. organize your tables to fit your future needs, think and define the relations between them good, apply proper indexes.
As the second step - unavoidable in my opinion, write a little program in the programming language of your own. It should be able to
mainly read records/lines from a (CSV) file,
validate/sanitize the fetched data
import/save the data in the correspondent database tables, as needed. By "as needed", I mean, that, in time, can appear a multitude of factors which could unexpectedly influence your first db-structural decisions. For example, the need for some temporal tables. Also, you should benefit from the advantages given to you by the triggers and stored procedures.
properly handle the errors and exceptions raised along the importing process. For example, due to eventual "duplicate key" issues - because data in files can be error-prone, some records could not be imported in a certain day. That doesn't mean that the import should break. Read a record, try to save it. If a problem appears, handle it (copy the line in another file, or save it in a special table, for a later editing/revision and re-import) and let the program follow its course with the next records.
properly log all (main) operations and maintain a counter of the read and of the problematic records.
automatically copy each daily file - after import - in a backup directory, until its not needed anymore.
eventually signalize you per email about the status of the operations.
The third step would be to find a solution to automatize the whole cycle. For example, to find a tasks/cron-jobs manager to start your program daily, once or even twice a day, without you having to make this manually.
Regarding of splitting the file into different files, based on your ddatabase structure: it wouldn't be necessary, e.g. it would be a redundant step, since your program should manage to read the file and handle the data import correspondingly.
As of the type of the program: it should be a web solution, so that you can access and modify it any time you need.
Good luck.

Big quantity of data with MySql

i have a question about Mass storage. Actually, i'm working with 5 sensors which sends a lot of datas with a different frequency for each one and i'm using MySQL DATABASE.
so here is my questions:
1) is MySQL the perfect solution.
2) if not, is there a solution to store this big quantity of data in a data base?
3) I'm using Threads in this and i'm using mutexs also, i'm afraid if this can cause problems, Actually,it seems to be.
i hope i will have an answer to this question.
MySql is good solution for OLTP scenarios where you are storing transactions to serve web or mobile apps. But it does not scale well (despite of cluster abilities).
There are many options out there based on what is important to you:
File System: You can device your own write-ahead-log solution to solve multi-threading problems and achieve "eventual consistency". That way you don't have to lock data for one thread at a time. You can use schema-full files like CSV, Avro or Parquet. Also you can use S3 or WSB for cloud based block storage. Or HDFS for just block and replicated storage.
NoSql: You can store each entry as document in NoSql Document stores. If you want to keep data in memory for faster read, explore Memcached or Redis. If you want to perform searches on data, use Solr or ElasticSearch. MongoDB is popular but it has scalability issues similar to MySql, instead I would chose Cassandra or HBase if you need more scalability. With some of NoSql stores, you might have to parse your "documents" at read time which may impact analytics performance.
RDBMS: As MySql is not scalable enough, you might explore Teradata and Oracle. Latest version of Oracle offers petabyte query capabilities and in-memory caching.
Using a database can add extra computation overhead if you have a "lot of data". Another question is what you do with the data? If you only stack them, a map/vector can be enough.
The first step is maybe to use map/vector that you can serialize to a file when needed. Second you can add the database if you wish.
About mutex if you share some code with different thread and if (in this code) you work on the same data at the same time, then you need them. Otherwise remove them. BTW if you can separate read and write operations then you don't need mutex/semaphore mechanism.
You can store data anywhere, but the data storage structure selection would depends on the use cases (the things, you want to do with the data).
It could be HDFS files, RDBMS, NoSQL DB, etc.
For example your common could be:
1. to save the sensor data very quickly.
2. get the sensor data on the definite date.
Then, you can use MongoDB or Cassandra.
If you want to get deep analytics (to get monthly average sensor data), you definitely should think about another solutions.
As for MySQL, it could also be used for some reasonable big data storage,
as it supports sharding. It fits some scenarios well, some not.
But I repeat, all would depend on use cases, i.e. the things you want to do with data.
So you could provide question with more details (define desired use-cases), or ask again.
There are several Questions that discuss "lots of data" and [mysql]. They generally say "yes, but it depends on what you will do with it".
Some general statements (YMMV):
a million rows -- no problem.
a billion rows or a terabyte of data -- You will run into problems, but they are no insurmountable.
100 inserts per second on spinning disk -- probably no problem
1000 rows/second inserted can be done; troubles are surmountable
creating "reports" from huge tables is problematical until you employ Summary Tables.
Two threads storing into the same table at the "same" time? Every RDBMS (MySQL included) solves that problem before the first release. The Mutexes (or whatever) are built into the code; you don't have to worry.
"Real time" -- If you are inserting 100 sensor values per second and comparing each value to one other value: No problem. Comparing to a million other values: big problem with any system.
"5 sensors" -- Read each hour? Yawn. Each minute? Yawn. Each second? Probably still Yawn. We need more concrete numbers to help you!

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

Is there a better place to store large amounts of unused data than a the database?

So the application we've got calls the API's of all the major carriers (UPS, FedEx, etc) for tracking data.
We save the most recent version of the XML feed we get from them in a TEXT field in a table in our database.
We really hardly ever (read, never so far) access that data, but have it "just in case."
It adds quite a bit of additional weight to the database. Right now a 200,000 row table is coming in at around 500MB...the large majority of which is compromised of all that XML data.
So is there a more efficient way to store all that XML data? I had thought about saving them as actual text/xml files, but we update the data every couple of hours, so wasn't sure if it would make sense to do that.
Assuming it's data there's no particular reason not to keep it in your database (unless it's impeding your backups). But it would be a good idea to keep it in a separate table from the actual data that you do need to read on a regular basis — just the XML, a FK back to the original table, and possibly an autonumbered PK column.
It has been my experience that the biggest trouble with TEXT/BLOB columns that are consistently large, is that people are not careful to prevent reading them when scanning many rows. On MyISAM, this will waste your VFS cache, and on InnoDB, it will waste your InnoDB buffer pool.
A secondary problem is that as tables get bigger, they become harder to maintain.. adding a column or index can rebuild the whole table, and a 500MB table rebuilds a lot slower than a 5MB table.
I've had good success moving things like this off into offline key/value storage such as MogileFS, and/or TokyoTyrant.
If you don't need to be crazy scalable, or you must value transactional consistency over performance, then simply moving this column into another table with a 1:1 relationship with the original table will at least require a join to blow out the buffer pool, and allow you to maintain the original table w/o having to tip-toe around the 500MB gorilla.
if its really unused, try:
/dev/null
I don't know what kind of data these XML streams contain, but maybe you can parse it and store only the pertinent info in a table or set of tables that way you can eliminate some of the XML's bloat.
Learn about OLAP techniques and data warehouses. They are probably what are you looking for.
As a DATAbase is designed to store DATA this seems to be the logical place for it. A couple of suggestions:
Rather than storing it in a seperate table is to use a seperate database. If the information isn't critical
Have a look athe the compress and uncompress functions as this could reduce the size of the verbose XML.
I worked on one project where we split data between the database and file system. After this experience I vowed never again. Backups and maintenance of various production/test/dev environments turned into a nightmare.
Why not store them to text files, and them keep a simple path (or relative path) in the database?
We used to do something similar in the seismic industry where the bulk of the data were big arrays of floating point numbers. Much more efficient to store these as files on disk (or tape), and then only keep trace meta data (position/etc) in a RDBMS-like database (I at about the time they were porting to Oracle!). Even with the old system, the field data was always on disk and easily accessible - it was used more frequently than the array data (although, unlike in your case, this was most definitely essential!