How to index a 1 billion row CSV file with elastic search? - csv

Imagine you had a large CSV file - let's say 1 billion rows.
You want each row in the file to become a document in elastic search.
You can't load the file into memory - it's too large so has to be streamed or chunked.
The time taken is not a problem. The priority is making sure ALL data gets indexed, with no missing data.
What do you think of this approach:
Part 1: Prepare the data
Loop over the CSV file in batches of 1k rows
For each batch, transform the rows into JSON and save them into a smaller file
You now have 1m files, each with 1000 lines of nice JSON
The filenames should be incrementing IDs. For example, running from 1.json to 1000000.json
Part 2: Upload the data
Start looping over each JSON file and reading it into memory
Use the bulk API to upload 1k documents at a time
Record the success/failure of the upload in a result array
Loop over the result array and if any upload failed, retry

The steps you've mentioned above looks good. A couple of other things which will make sure ES does not get under load:
From what I've experienced, you can increase the bulk request size to a greater value as well, say somewhere in the range 4k-7k (start with 7k and if it causes pain, experiment with smaller batches but going lower than 4k probably might not be needed).
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.
As the above comment suggests, it'd be better if you start with a smaller batch of data. Of-course, if you use constants instead of hardcoding the values, your task just got easier.

Related

Large JSON Storage

Summary
What is the "best practice" way to store large JSON arrays on a remote web service?
Background
I've got a service, "service A", that generates JSON objects, an "item", no larger than 1KiB. Every time it emits an item, the item needs to be appended to a JSON array. Later, a user can get all these arrays of items, which can be 10s of MiB or more.
Performance
What is the best way to store JSON to make appending and retrieval performant? Ideally, insertation would be O(1) and retrieval would be fast enough that we didn't need to tell the user to wait until their files have downloaded.
The downloads have never become so large that the constraint is the time to download them from the server (if they were a 10 MiB file). The constraint has always been the time to compute the file.
Stack
Our current stack is running Django + Postgresql on Elasticbeanstalk. New services are acceptable (e.g. S3 if append were supported).
Attempted Solutions
When we try to store all JSON in a single row in the database, performance is understandably slow.
When we try to store each JSON object in a separate row, it takes too long to aggregate the separate rows into a single array of items. In addition, a user requests all item arrays in their account every time they visit the main screen of the app, so it is inefficient to recompute the aggregated array of items each time.

Performing joins on very large data sets

I have received several CSV files that I need to merge into a single file, all with a common key I can use to join them. Unfortunately, each of these files are about 5 GB in size (several million rows, about 20-100+ columns), so it's not feasible to just load them up in memory and execute a join against each one, but I do know that I don't have to worry about existing column conflicts between them.
I tried making an index of the row for each file that corresponded to each ID so I could just compute the result without using much memory, but of course that's slow as time itself when actually trying to look up each row, pull the rest of the CSV data from the row, concatenate it to the in-progress data and then write out to a file. This simply isn't feasible, even on an SSD, to process against the millions of rows in each file.
I also tried simply loading up some of the smaller sets in memory and running a parallel.foreach against them to match up the necessary data to dump back out to a temporary merged file. While this was faster than the last method, I simply don't have the memory to do this with the larger files.
I'd ideally like to just do a full left join of the largest of the files, then full left join to each subsequently smaller file so it all merges.
How might I otherwise go about approaching this problem? I've got 24 gb of memory on this system to work with and six cores to work with.
While this might just be a problem to load up in a relational database and do the join there from, I thought I'd reach out before going that route to see if there are any ideas out there on solving this from my local system.
Thanks!
A relational database is the first thing that comes to mind and probably the easiest, but barring that...
Build a hash table mapping key to file offset. Parse the rows on-demand as you're joining. If your keyspace is still too large to fit in available address space, you can put that in a file too. This is exactly what a database index would do (though maybe with a b-tree).
You could also pre-sort the files based on their keys and do a merge join.
The good news is that "several" 5GB files is not a tremendous amount of data. I know it's relative, but the way you describe your system...I still think it's not a big deal. If you weren't needing to join, you could use Perl or a bunch of other command-liney tools.
Are the column names known in each file? Do you care about the column names?
My first thoughts:
Spin up Amazon Web Services (AWS) Elastic MapReduce (EMR) instance (even a pretty small one will work)
Upload these files
Import the files into Hive (as managed or not).
Perform your joins in Hive.
You can spinup an instance in a matter of minutes and be done with the work within an hour or so, depending on your comfort level with the material.
I don't work for Amazon, and can't even use their stuff during my day job, but I use it quite a bit for grad school. It works like a champ when you need your own big data cluster. Again, this isn't "Big Data (R)", but Hive will kill this for you in no time.
This article doesn't do exactly what you need (it copies data from S3); however, it will help you understand table creation, etc.
http://aws.amazon.com/articles/5249664154115844
Edit:
Here's a link to the EMR overview:
https://aws.amazon.com/elasticmapreduce/
I'm not sure if you are manipulating the data. But if just combining the csv's you could try this...
http://www.solveyourtech.com/merge-csv-files/

MySQL Blob vs. Disk for "video frames"

I have a c++ app that generates 6x relatively small image-like integer arrays per second. The data is 64x48x2-dimensional int (ie, a grid of 64x48 two-dimensional vectors, with each vector consisting of two floats). That works out to ~26kb per image. The app also generates a timestamp and some features describing the data. I want to store the timestamp and the features in a MySQL db column, per frame. I also need to store the original array as binary data, either in a file on disc or as a blob field in the database. Assume that the app will be running more or less nonstop, and that I'll come up with a way to archive data older than a certain age, so that storage does not become a problem.
What are the tradeoffs here for blobs, files-on-disc, or other methods I may not even be thinking of? I don't need to query against the binary data, but I need to query against the other metadata/features in the table (I'll definitely have an index built against timestamp), and retrieve the binary data. Does the equation change if I store multiple frames in a single file on disk, vs. one frame per file?
Yes, I've read MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems and To Do or Not to Do: Store Images in a Database, but I think my question differs because in this case there are going to be millions of identically-dimensioned binary files. I'm not sure how the performance hit to maintaining that many small files in a filesystem compares to storing that many files in db blob columns. Any perspective would be appreciated.
At a certain point, querying for many blobs becomes unbearably slow. I suspect that even if your identically dimensioned binary files this will be the case. Moreover you will still need some code to access and process the blobs. And this doesn't take advantage of file caching that might speed up image queries straight from the file system.
But! The link you provided did not mention object based databases, which can store the data you described in a way that you can access it extremely quickly, and possibly return it in native format. For a discussion see the link or just search google, there are many discussions:
Storing images in NoSQL stores
I would also look into HBase.
I figured since you were not sure about what to use in the first place(and there were no answers), an alternative solution might be appropriate.

Checking for Duplicate Files without Storing their Checksums

For instance, you have an application which processes files that are sent by different clients. The clients send tons of files everyday and you load the content of those files into your system. The files have the same format. The only constraint that you are given is you are not allowed to run the same file twice.
In order to check if you ran a particular file is to create a checksum of the file and store it in another file. So when you get a new file, you can create the checksum of that file and compare against the checksums of others files that you have run and stored.
Now, the file that contains all the checksums of all the files that you have run so far is getting really, really huge. Searching and comparing is taking too much time.
NOTE: The application uses flat files as its database. Please do not suggest to use rdbms or the like. It is simply not possible at the moment.
Do you think there could be another way to check the duplicate files?
Keep them in different places: have one directory where the client(s) upload files for processing, have another where those files are stored.
Or are you in a situation where the client can upload the same file multiple times? If that's the case, then you pretty much have to do a full comparison each time.
And checksums, while they give you confidence that two files are different (and, depending on the checksum, a very high confidence), are not 100% guaranteed. You simply can't take a practically-infinite universe of possible multi-byte streams and reduce them to a 32 byte checksum, and be guaranteed uniqueness.
Also: consider a layered directory structure. For example, a file foobar.txt would be stored using the path /f/fo/foobar.txt. This will minimize the cost of scanning directories (a linear operation) for the specific file.
And if you retain checksums, this can be used for your layering: /1/21/321/myfile.txt (using least-significant digits for the structure; the checksum in this case might be 87654321).
Nope. You need to compare all files. Strictly, need to to compare the contents of each new file against all already seen files. You can approximate this with a checksum or hash function, but should you find a new file already listed in your index then you then need to do a full comparison to be sure, since hashes and checksums can have collisions.
So it comes down to how to store the file more efficiently.
I'd recommend you leave it to professional software such as berkleydb or memcached or voldemort or such.
If you must roll your own you could look at the principles behind binary searching (qsort, bsearch etc).
If you maintain the list of seen checksums (and the path to the full file, for that double-check I mentioned above) in sorted form, you can search for it using a binary search. However, the cost of inserting each new item in the correct order becomes increasingly expensive.
One mitigation for a large number of hashes is to bin-sort your hashes e.g. have 256 bins corresponding to the first byte of the hash. You obviously only have to search and insert in the list of hashes that start with that byte-code, and you omit the first byte from storage.
If you are managing hundreds of millions of hashes (in each bin), then you might consider a two-phase sort such that you have a main list for each hash and then a 'recent' list; once the recent list reaches some threshold, say 100000 items, then you do a merge into the main list (O(n)) and reset the recent list.
You need to compare any new document against all previous documents, the efficient way to do that is with hashes.
But you don't have to store all the hashes in a single unordered list, nor does the next step up have to be a full database. Instead you can have directories based on the first digit, or 2 digits of the hash, then files based on the next 2 digits, and those files containing sorted lists of hashes. (Or any similar scheme - you can even make it adaptive, increasing the levels when the files get too big)
That way searching for matches involves, a couple of directory lookups, followed by a binary search in a file.
If you get lots of quick repeats (the same file submitted at the same time), then a Look-aside cache might also be worth having.
I think you're going to have to redesign the system, if I understand your situation and requirements correctly.
Just to clarify, I'm working on the basis that clients send you files throughout the day, with filenames that we can assume are irrelevant, and when you receive a file you need to ensure its [i]contents[/i] are not the same as another file's contents.
In which case, you do need to compare every file against every other file. That's not really avoidable, and you're doing about the best you can manage at the moment. At the very least, asking for a way to avoid the checksum is asking the wrong question - you have to compare an incoming file against the entire corpus of files already processed today, and comparing the checksums is going to be much faster than comparing entire file bodies (not to mention the memory requirements for the latter...).
However, perhaps you can speed up the checking somewhat. If you store the already-processed checksums in something like a trie, it should be a lot quicker to see if a given file (rather, checksum) has already been processed. For a 32-character hash, you'd need to do a maximum of 32 lookups to see if that file had already been processed rather than comparing with potentially every other file. It's effectively a binary search of the existing checksums rather than a linear search.
You should at the very least move the checksums file into a proper database file (assuming it isn't already) - although SQLExpress with its 4GB limit might not be enough here. Then, along with each checksum store the filename, file size and date received, add indexes to file size and checksum, and run your query against only the checksums of files with an identical size.
But as Will says, your method of checking for duplicates isn't guaranteed anyway.
Despite you asking not to suggets and RDBMS I still will suggest SQLite - if you store all checksums in one table with an index searches will be quite fast and integrating SQLite is not a problem at all.
As Will pointed out in his longer answer, you should not store all hashes in a single large file, but simply split them up into several files.
Let's say the alphanumeric-formatted hash is pIqxc9WI. You store that hash in a file named pI_hashes.db (based on the first two characters).
When a new file comes in, calculate the hash, take the first 2 characters, and only do the lookup in the CHARS_hashes.db file
After creating a checksum, create a directory with the checksum as the name and then put the file in there. If there are already files in there, compare your new file with the existing ones.
That way, you only have to check one (or a few) files.
I also suggest to add a header (a single line) to the file which explains what's inside: The date it was created, the IP address of the client, some business keys. The header should be selected in such a way that you can detect duplicates be reading this single line.
[EDIT] Some file systems bog down when you have a directory with many entries (in this case: the checksum directories). If this is an issue for you, create a second layer by using the first two characters of the checksum as the name of the parent directory. Repeat as necessary.
Don't cut off the two characters from the next level; this way, you can easily find files by checksum if something goes wrong without cutting checksums manually.
As mentioned by others, having a different data structure for storing the checksums is the correct way to go. Anyways, although you have mentioned that you dont want to go the RDBMS way, why not try sqlite? You can use it like a file, and it is lightning fast. It is also very simple to use - most languages has sqlite support built-in, too. It will take you less than 40 lines of code in say python.

Using Memcache as a counter for multiple objects

I have a photo-hosting website, and I want to keep track of views to the photos. Due to the large volume of traffic I get, incrementing a column in MySQL on every hit incurs too much overhead.
I currently have a system implemented using Memcache, but it's pretty much just a hack.
Every time a photo is viewed, I increment its photo-hits_uuid key in Memcache. In addition, I add a row containing the uuid to an invalidation array also stored in Memcache. Every so often I fetch the invalidation array, and then cycle through the rows in it, pushing the photo hits to MySQL and decrementing their Memcache keys.
This approach works and is significantly faster than directly using MySQL, but is there a better way?
I did some research and it looks like Redis might be my solution. It seems like it's essentially Memcache with more functionality - the most valuable to me is listing, which pretty much solves my problem.
There is a way that I use.
Method 1: (Size of a file)
Every time that someone hits the page, I add one more byte to a file. Then after x seconds or so (I set 600), I will count how many bytes that are in my file, delete my file, then I update it to the MySQL database. This will also allow scalability if multiple servers are adding to a small file in a cache server. Use fwrite to append to the file and you will never have to read that cache file.
Method 2: (Number stored in a file)
Another method is to store a number in a text file that contains the number of hits, but I recommend from using this because if two processes were simultaneously updating, data might be off (maybe same with method1).
I would use method 1 because although it is a bigger file size, it is faster.
I'm assuming you're keeping access logs on your server for this solution.
Keep track of the last time you checked your logs.
Every n seconds or so (where n is less than the time it takes for your logs to be rotated, if they are), scan through the latest log file, ignoring every hit until you find a timestamp after your last check time.
Count how many times each image was accessed.
Add each count to the count stored in the database.
Store the timestamp of the last log entry you processed for next time.