Apache Camel problems aggregating large (1mil record) CSV files - csv

My question is (1) is there a better strategy for solving my problem (2) is it possible to tweak/improve my solution so it works and doesn't split the aggregation in a reliable manner (3 the less important one) how can i debug it more intelligently? figuring out wtf the aggregator is doing is difficult because it only fails on giant batches that are hard to debug because of their size. answers to any of these would be very useful, most importantly the first two.
I think the problem is I'm not expressing to camel correctly that I need it to treat the CSV file coming in as a single lump and i dont want the aggregator to stop till all the records have been aggregated.
I'm writing a route to digest a million line CSV file, split then aggregate the data on some key primary fields, then write the aggregated records to a table
unforuntaely the primary constraints of the table are getting violated (which also correspond to the aggregation keys), implying that the aggregator is not waiting for the whole input to finish.
it works fine for small files of a few thousand records, but on the large sizes it will actually face in production, (1,000,000 records) it fails.
Firstly it fails with a JavaHeap memory error on the split after the CSV unmarshall. I fix this with .streaming(). This impacts the aggregator, where the aggregator 'completes' too early.
to illustrate:
A 1
A 2
B 2
--- aggregator split ---
B 1
A 2
--> A(3),B(2) ... A(2),B(1) = constraint violation because 2 lots of A's etc.
when what I want is A(5),B(3)
with examples of 100, 1000 etc, records it works fine and correctly. but when it processes 1,000,000 records, which is the real-size it needs to handle, firstly the split() gets an OutOfJavaHeapSpace exception.
I felt that simply changing the heap-size would be a short-term solution and just pushing the problem back until the next upper-limit of records comes through, so I got around it by using the .streaming() on the split.
Unfortunately now, the aggregator is being drip-fed the records, not getting them in a big cludge and it seems to be completing early and doing another aggregation, which is violating my primary constraint
from( file://inbox )
.unmarshall().bindy().
.split().body().streaming()
.setHeader( "X" Expression building string of primary-key fields)
.aggregate( header("X") ... ).completionTimeout( 15000 )
etc.
I think part of the problem is that I'm depending on the streaming split not timeing out longer than a fixed amount of time, which just isn't foolproof - e.g. a system task might reasonably cause this, etc. Also everytime I increase this timeout it makes it longer and longer to debug and test this stuff.
PRobably a better solution would be to read the size of the CSV file that comes in and not allow the aggregator to complete until every record has been processed. I have no idea how I'd express this in camel however.
Very possibly I just have a fundamental stategy misunderstanding of how I should be approaching / describing this problem. There may be a much better (simpler) approach that I dont know.
there's also such a large amount of records going in, I can't realistically debug them by hand to get an idea of what's happening (I'm also breaking the timeout on the aggregator when I do, I suspect)

You can split the file first one a line by line case, and then convert each line to CSV. Then you can run the splitter in streaming mode, and therefore have low memory consumption, and be able to read a file with a million records.
There is some blog links from this page http://camel.apache.org/articles about splitting big files in Camel. They cover XML though, but would be related to splitting big CSV files as well.

Related

Performing joins on very large data sets

I have received several CSV files that I need to merge into a single file, all with a common key I can use to join them. Unfortunately, each of these files are about 5 GB in size (several million rows, about 20-100+ columns), so it's not feasible to just load them up in memory and execute a join against each one, but I do know that I don't have to worry about existing column conflicts between them.
I tried making an index of the row for each file that corresponded to each ID so I could just compute the result without using much memory, but of course that's slow as time itself when actually trying to look up each row, pull the rest of the CSV data from the row, concatenate it to the in-progress data and then write out to a file. This simply isn't feasible, even on an SSD, to process against the millions of rows in each file.
I also tried simply loading up some of the smaller sets in memory and running a parallel.foreach against them to match up the necessary data to dump back out to a temporary merged file. While this was faster than the last method, I simply don't have the memory to do this with the larger files.
I'd ideally like to just do a full left join of the largest of the files, then full left join to each subsequently smaller file so it all merges.
How might I otherwise go about approaching this problem? I've got 24 gb of memory on this system to work with and six cores to work with.
While this might just be a problem to load up in a relational database and do the join there from, I thought I'd reach out before going that route to see if there are any ideas out there on solving this from my local system.
Thanks!
A relational database is the first thing that comes to mind and probably the easiest, but barring that...
Build a hash table mapping key to file offset. Parse the rows on-demand as you're joining. If your keyspace is still too large to fit in available address space, you can put that in a file too. This is exactly what a database index would do (though maybe with a b-tree).
You could also pre-sort the files based on their keys and do a merge join.
The good news is that "several" 5GB files is not a tremendous amount of data. I know it's relative, but the way you describe your system...I still think it's not a big deal. If you weren't needing to join, you could use Perl or a bunch of other command-liney tools.
Are the column names known in each file? Do you care about the column names?
My first thoughts:
Spin up Amazon Web Services (AWS) Elastic MapReduce (EMR) instance (even a pretty small one will work)
Upload these files
Import the files into Hive (as managed or not).
Perform your joins in Hive.
You can spinup an instance in a matter of minutes and be done with the work within an hour or so, depending on your comfort level with the material.
I don't work for Amazon, and can't even use their stuff during my day job, but I use it quite a bit for grad school. It works like a champ when you need your own big data cluster. Again, this isn't "Big Data (R)", but Hive will kill this for you in no time.
This article doesn't do exactly what you need (it copies data from S3); however, it will help you understand table creation, etc.
http://aws.amazon.com/articles/5249664154115844
Edit:
Here's a link to the EMR overview:
https://aws.amazon.com/elasticmapreduce/
I'm not sure if you are manipulating the data. But if just combining the csv's you could try this...
http://www.solveyourtech.com/merge-csv-files/

Performance Issues with Include in Entity Framework

I am working on a large application being developed using Repository Pattern, Web APIs, AngularJS. In one of the scenario, I am trying to retrieve data from a single lead which has relations with approx. 20 tables. Lazy loading is disable, so I am using Include to get the data from all the 20 tables. Now, here comes performance issue, if I try to retrieve single record, it takes approx. 15 seconds. This is a huge performance issue. I am returning JSON and my entities are decorated with DataContract(IsReference = true)/ Data Member attribute.
Any suggestions will be highly appreciated.
Include is really nasty for performance because of how it joins.
See more info in my blog post here http://mikee.se/Archive.aspx/Details/entity_framework_pitfalls,_include_20140101
To summarize the problem a bit it's because EF handles Include by joining. This creates a result set where every row includes every column of every joined entity (Some contain null values).
This is even more nasty if the root entity contains large fields (like a long text or a binary) because that one get repeated.
15s is way too much though. I suspect something more is at play like missing indexes.
To summarize the solutions. My suggestion is normally that you load every relation separately or in a multiquery. A simple query like that should be 5-30ms per entity depending on your setup. In this case it would still be quite slow (~1s if you are querying on indexes). Maybe you need to look at some way to store this data in a better format if this query is run often (Cache, document, json in the db). I can't help you with that though, would need far more information as the update paths affect the possibilities a lot.
The performance has been improved by Enabling Lazy Loading.

mysql getting rid of redundant values

I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Basic question: Querying data and performance tradeoffs

Let's say I have 100 rows in my table, with 3 columns of numbers. I don't need all the rows, only about half of them every time I fetch data. I only want the rows that have updated as getting the rest would be redundant.
Is it better to add a field and give it a datetime field to represent that it has updated since the last time I've fetched it (and use that as a criteria when SELECTing)? Or would it be better to simply download all the data each and every time (currently the data is being sent back as a JSON file).
What are the tradeoffs in terms of speed, bandwidth usage, and server cpu usage between these two options? Is the former just plain better than the latter?
Both Jens Struwe and roycl are right - but as you're asking a hypothetical question, you're going to get answers that are right and contradictory.
If only half the data is relevant, how is the client going to determine which data to show? If the decision can be made by software at all, it's more efficient to do it on the database - but it's also more logical.
With tables of 100 rows, performance is neither here nor there; maintainability and long-term upgradability is a far bigger deal. Most developers would expect a logical database design, and sorting/filtering to be done on the DB rather than the client.
Always (or at least if possible) select only data that you need to accomplish your task. Vice versa: Never select data that you have to filter out. In result: Add a timestamp field for the updates and select only these rows whose timestamp is > than the given one.
With a 100 rows in your table and 3 columns of numbers it really doesn't matter which approach you use if you don't mind if the server returns the data in less than a few 10s of milliseconds. The rows, if queried frequently, will all be in memory anyway. It also makes your json code simpler and your client code dumber (which is probably good, and more maintainable).
If you had a several-million row table with only a small percentage of data that was required, you would naturally want to limit the return set, and the easiest way of doing that is with an SQL WHERE clause, such as WHERE dt_modified > my_timestamp. On a properly optimised database even this query could come in at well under 100ms.
The issue may be more to do with time the data spends "on the wire", how much time the client spends either regenerating the page, or updating it based on the returned data. Client processing tim is often the slowest part of the process. Only testing on different browsers and over different network speeds will find the best balance between server-side tweeks, network fixes (such as gzipping to compress data) and optimising your javascript calls.