How can I sort a very large CSV file?

How can I sort a very large CSV file? - csv

I have this large 294,000 row csv with urls in column 1 and numbers in column 2.
I need to sort them from the smallest number to the largest number. I have loaded it into the software 'CSVed' and it handles it okay, it doesn't crash or anything but when I click the top of the column to sort it, it doesn't make it in order from smallest to largest, it's all just muddled up.
Anyone have any ideas? I've been searching around all day, I thought I might ask here.
Thanks.

If you have access to a unix system (and your urls don't have commas in them) this should do the trick:
sort -t',' -n -k2 filename
Where -t says columns are delimited by commas, -n says the data is numeric, and -k2 says to sort based on the second column.

You can use gnu sort. It takes has small memory footprint and can even use multiple CPUs for sort.
sort -t , -k 2n file.csv
Gnu sort is available by default in most of linux distributions as well as for MacOS by default (though later has slightly different options). You can install it for windows as well, for example from CoreUtils for Windows page.
For more information about sort invocation use the manual

Related

Speed up an algorithm using PHP for large textual data and files

There are two tables as below:-
document table - this table contains the path of the file which actually contains HTML content and also has a column for hierarchy
find and replace - this table contains the word to find and to replace( the replace string can be a link or HTML itself ) and remaining fields are comma separated ids (document ID from table 1) which tells which word is to be replaced in which document
In short, this process will allow the user to find and replace keywords based on the second table and only in the documents required.
The algorithm works as below:-
Get count of all records in the documents table
Break in sets of 100 records ( to reduce server timeout )
loop over the set of 100 each and for each record here using the document id and hierarchy no get the list of keywords and also the content to be replaced with to be replaced in this particular document (Note, the where condition runs on comma separated string)
fetch the file from the server using the path in the first table and extract the HTML content
run a loop on each keyword in sequence and replace with the required content as per the second table in the content
create a final file and save on the server
The process works fine and gives desired results too.
The problem begins when the data increases. As for now, there are around 50,000 entries in the first table and thus the same number of files on the server.
The second table contains around 15000 records of find and replaces keywords with long strings comma separated with documents id.
For such amount of data, this process will run for days and that should not happen.
For database MySQL 5.5 is used and the backend is PHP(Laravel 5.4). OS is centos 7 with nginx web server.
Is there a way to make this process smooth and less time-consuming? Any help is appreciated.

php has a function shell_exec($shellCommand);
You may wish to use the gnu/linux shell-accessible program called sed (stream editor) to do this substitution rather than slurping each file into php then writing it out again.
For example,
$result = shell_exec
("cd what/ever/directory; sed 's/this/that/g' inputfile > outputfile");
will read what/ever/directory/inputfile, change all the this strings to that, and write the result into what/ever/directory/outputfile. And, it will do it very quickly compared to php.
Edit: Why does this approach save a lot of time?
Shell programs like sed have been around for decades and are highly optimized. sed uses far less processing power--far fewer cpu cycles--than php to do what it does. So the transformation of the files is faster.
The task of editing a file requires reading, transforming, and writing it. Doing this operation the way you describe requires each of those phases to finish before the next one can start. On the other hand, sed is a stream editor. It reads, transforms, and writes all in parallel.
To get the most out of this approach, you'll need to get your php program to write more complex editing commands than 's/this/that/g'. You'll want to do multiple substitutions in a single sed run. You can do that by concatenating editing instructions like this example:
's/this/that/; s/blue/azul/g; s/red/rojo/g'
A single shell command can be around 100K characters in length, so you probably won't hit limits on the length of those editing instructions.
By suggesting the use of sed I do suggest using a differnt algorithm.

How to count number of ids in a file

So I have a huge file containing hundreds of thousands lines. I want to know how many different sessions or ids it contains. I really thought it wouldn't be that hard to do, but I'm unable to find a way.
Sessions look like this:
"session":"1425654508277"
So there will be a few thousand lines with that session, then it will switch, not necessarily incrementing by one, at all, I don't know the pattern if there's one. So I just want to know how many sessions appear in the document, how many are different between each other (they SHOULD be consecutive but it's not a requirement just something I noticed).
Is there an easy way to do this? Only things I've found even remotely close are excel macros and scripts, which lead me to think I'm not asking the right questions. I also found this: Notepad++ incrementally replace but it does not help in my case.
Thanks in advance.

Consider using jq. You can extract session with [.session], then apply unique, then length.
https://stedolan.github.io/jq/manual/
I am no jq expert, and have not tested this, but it seems that the program
unique_by(.message) | length
might give you what you want.

According to your profile, you know JavaScript, so you can use that:
Load the file.
Look for session. (If this is JSON, this could be as simple as myJson['session'].)
Keyed on session value, add to a map, e.g. myCounts[sessionValue] = doesNotMatter.
Count the number of keys in the map.
There are easier ways, like torazaburo's suggestion to use cat data | uniq | wc, but it doesn't sound like you want to learn Unix, so you may as well practice your JavaScript (I do this myself when learning programming languages: use it for everything).

You won't be able to achieve this with notepad++, but you can use a linux command shell command, i.e.:
cat sessions.txt | uniq | wc

Adding to my own question, if you manage to get the strings you want separated by columns in Excel, Excel has an option to Filter which automatically gives you the different values to filter a column by.
This means, applied to my case, if I get the key-value ("session":"idSession", the 100000 values each in a row), all of it in one column, filter, count manually, I get the number of different values.
Didn't get to try the wc/unix option because I found this while trying to apply the other method

What is generally faster, grepping through files or running a SQL LIKE %x% query through blobs?

Say I'm designing a tool that would save code snippets either in a PostgreSQL/MySQL database or on the file system. I want to search through these snippets. Using a search engine like Sphinx doesn't seem practical because we need exact text matches of code when searching code.
grep and ack and has always worked great, but storing stuff in a database makes a large collection of stuff more manageable in certain ways. I'm wonder what the relative performance of running grep recursively over a tree of directories is compared to running a query like SQL's LIKE or MySQL's REGEXP function over an equivalent number of records with TEXT blobs is.

If you've 1M files to grep through, you will (best I'm aware) go through each one with a regular expression.
For all intents and purposes, you're going to end up doing the same thing over table rows if you mass-query them using a LIKE operator or a regular expression.
My own experience with grep is that I seldom look for something that doesn't contain at least one full word, however, so you might be able to take advantage of a database to reduce the set in which you're searching.
MySQL has native full text search features, but I'd recommend against because they mean you're not using InnoDB.
You can read about those from Postgres here:
http://www.postgresql.org/docs/current/static/textsearch.html
After creating an index on a tsvector column, you can then do your "grep" in two steps, one to immediately find rows that might vaguely qualify, followed by another on your true criteria:
select * from docs where tsvcol ## :tsquery and (regexp at will);
That will be significantly faster than anything grep can do.

I can't compare them but both will take long. My guess is grep will be faster.
But MySQL support full text indexing and searching, which will be faster then grep -- i guess again.
Also, I did not understand, what is the problem with Sphinx or Lucene. Anyway, here's a benchmark for MySQL, Sphinx and Lucene

The internet seems guess that grep uses Boyer-Moore, which will make the query time depend additively (not multiplicatively) on the query size. This isn't that relevant though.
I think it's near-optimal for a one-time search. But in your case you can do better since you have repeated searches, which you can exploit the structure of (e.g. by indexing certain common substrings in your query), as bpgergo hints at.
Also I'm not sure the regular expression engine you are thinking of using is optimized for a non-special query, you could try it and see.
You may wish to keep all the files you're searching through in memory to avoid harddisk-based slowdown. This should work unless you are searching a staggering amount of text.

If you want a full-text index on code I would recommend Russ Cox’ codesearch tools
https://code.google.com/p/codesearch/
This is How Google Code Search Worked
http://swtch.com/~rsc/regexp/regexp4.html

Checking for Duplicate Files without Storing their Checksums

For instance, you have an application which processes files that are sent by different clients. The clients send tons of files everyday and you load the content of those files into your system. The files have the same format. The only constraint that you are given is you are not allowed to run the same file twice.
In order to check if you ran a particular file is to create a checksum of the file and store it in another file. So when you get a new file, you can create the checksum of that file and compare against the checksums of others files that you have run and stored.
Now, the file that contains all the checksums of all the files that you have run so far is getting really, really huge. Searching and comparing is taking too much time.
NOTE: The application uses flat files as its database. Please do not suggest to use rdbms or the like. It is simply not possible at the moment.
Do you think there could be another way to check the duplicate files?

Keep them in different places: have one directory where the client(s) upload files for processing, have another where those files are stored.
Or are you in a situation where the client can upload the same file multiple times? If that's the case, then you pretty much have to do a full comparison each time.
And checksums, while they give you confidence that two files are different (and, depending on the checksum, a very high confidence), are not 100% guaranteed. You simply can't take a practically-infinite universe of possible multi-byte streams and reduce them to a 32 byte checksum, and be guaranteed uniqueness.
Also: consider a layered directory structure. For example, a file foobar.txt would be stored using the path /f/fo/foobar.txt. This will minimize the cost of scanning directories (a linear operation) for the specific file.
And if you retain checksums, this can be used for your layering: /1/21/321/myfile.txt (using least-significant digits for the structure; the checksum in this case might be 87654321).

Nope. You need to compare all files. Strictly, need to to compare the contents of each new file against all already seen files. You can approximate this with a checksum or hash function, but should you find a new file already listed in your index then you then need to do a full comparison to be sure, since hashes and checksums can have collisions.
So it comes down to how to store the file more efficiently.
I'd recommend you leave it to professional software such as berkleydb or memcached or voldemort or such.
If you must roll your own you could look at the principles behind binary searching (qsort, bsearch etc).
If you maintain the list of seen checksums (and the path to the full file, for that double-check I mentioned above) in sorted form, you can search for it using a binary search. However, the cost of inserting each new item in the correct order becomes increasingly expensive.
One mitigation for a large number of hashes is to bin-sort your hashes e.g. have 256 bins corresponding to the first byte of the hash. You obviously only have to search and insert in the list of hashes that start with that byte-code, and you omit the first byte from storage.
If you are managing hundreds of millions of hashes (in each bin), then you might consider a two-phase sort such that you have a main list for each hash and then a 'recent' list; once the recent list reaches some threshold, say 100000 items, then you do a merge into the main list (O(n)) and reset the recent list.

You need to compare any new document against all previous documents, the efficient way to do that is with hashes.
But you don't have to store all the hashes in a single unordered list, nor does the next step up have to be a full database. Instead you can have directories based on the first digit, or 2 digits of the hash, then files based on the next 2 digits, and those files containing sorted lists of hashes. (Or any similar scheme - you can even make it adaptive, increasing the levels when the files get too big)
That way searching for matches involves, a couple of directory lookups, followed by a binary search in a file.
If you get lots of quick repeats (the same file submitted at the same time), then a Look-aside cache might also be worth having.

I think you're going to have to redesign the system, if I understand your situation and requirements correctly.
Just to clarify, I'm working on the basis that clients send you files throughout the day, with filenames that we can assume are irrelevant, and when you receive a file you need to ensure its [i]contents[/i] are not the same as another file's contents.
In which case, you do need to compare every file against every other file. That's not really avoidable, and you're doing about the best you can manage at the moment. At the very least, asking for a way to avoid the checksum is asking the wrong question - you have to compare an incoming file against the entire corpus of files already processed today, and comparing the checksums is going to be much faster than comparing entire file bodies (not to mention the memory requirements for the latter...).
However, perhaps you can speed up the checking somewhat. If you store the already-processed checksums in something like a trie, it should be a lot quicker to see if a given file (rather, checksum) has already been processed. For a 32-character hash, you'd need to do a maximum of 32 lookups to see if that file had already been processed rather than comparing with potentially every other file. It's effectively a binary search of the existing checksums rather than a linear search.

You should at the very least move the checksums file into a proper database file (assuming it isn't already) - although SQLExpress with its 4GB limit might not be enough here. Then, along with each checksum store the filename, file size and date received, add indexes to file size and checksum, and run your query against only the checksums of files with an identical size.
But as Will says, your method of checking for duplicates isn't guaranteed anyway.

Despite you asking not to suggets and RDBMS I still will suggest SQLite - if you store all checksums in one table with an index searches will be quite fast and integrating SQLite is not a problem at all.

As Will pointed out in his longer answer, you should not store all hashes in a single large file, but simply split them up into several files.
Let's say the alphanumeric-formatted hash is pIqxc9WI. You store that hash in a file named pI_hashes.db (based on the first two characters).
When a new file comes in, calculate the hash, take the first 2 characters, and only do the lookup in the CHARS_hashes.db file

After creating a checksum, create a directory with the checksum as the name and then put the file in there. If there are already files in there, compare your new file with the existing ones.
That way, you only have to check one (or a few) files.
I also suggest to add a header (a single line) to the file which explains what's inside: The date it was created, the IP address of the client, some business keys. The header should be selected in such a way that you can detect duplicates be reading this single line.
[EDIT] Some file systems bog down when you have a directory with many entries (in this case: the checksum directories). If this is an issue for you, create a second layer by using the first two characters of the checksum as the name of the parent directory. Repeat as necessary.
Don't cut off the two characters from the next level; this way, you can easily find files by checksum if something goes wrong without cutting checksums manually.

As mentioned by others, having a different data structure for storing the checksums is the correct way to go. Anyways, although you have mentioned that you dont want to go the RDBMS way, why not try sqlite? You can use it like a file, and it is lightning fast. It is also very simple to use - most languages has sqlite support built-in, too. It will take you less than 40 lines of code in say python.

How measure percent difference in codebase?

Task at hand — I have three versions of some code, developed by different coders, one “parent” and two “child”, and need to calculate, which one is closer to the parent one.
Size of code at hand prohibits from manually counting diff's, and I failed to see any aggregate similarity stats in popular diffmerge tools I've tried.
Hao shot web^H^H^H^H^H^H^H acquire the single percent “similarity” number?
Thanks.

You could count the lines of the diff. On Linux you would do:
diff -r parent child1 | wc -l
diff -r parent child2 | wc -l
This way you get a rough difference in lines of code.

Perhaps you can use a Copy-Paste detector tool such as http://pmd.sourceforge.net/cpd.html. I haven't used it personally but it seems to be able to generate statistics.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008