I'd like to write a script that traverses a file tree, calculates a hash for each file, and inserts the hash into an SQL table together with the file path, such that I can then query and search for files that are identical.
What would be the recommended hash function or command like tool to create hashes that are extremely unlikely to be identical for different files?
Thanks
B
I've been working on this problem for much too long. I'm on my third (and hopefully final) rewrite.
Generally speaking, I recommend SHA1 because it has no known collisions (whereas MD5 collisions can be found in minutes), and SHA1 doesn't tend to be a bottleneck when working with hard disks. If you're obsessed with getting your program to run fast in the presence of a solid-state drive, either go with MD5, or waste days and days of your time figuring out how to parallelize the operation. In any case, do not parallelize hashing until your program does everything you need it to do.
Also, I recommend using sqlite3. When I made my program store file hashes in a PostgreSQL database, the database insertions were a real bottleneck. Granted, I could have tried using COPY (I forget if I did or not), and I'm guessing that would have been reasonably fast.
If you use sqlite3 and perform the insertions in a BEGIN/COMMIT block, you're probably looking at about 10000 insertions per second in the presence of indexes. However, what you can do with the resulting database makes it all worthwhile. I did this with about 750000 files (85 GB). The whole insert and SHA1 hash operation took less than an hour, and it created a 140MB sqlite3 file. However, my query to find duplicate files and sort them by ID takes less than 20 seconds to run.
In summary, using a database is good, but note the insertion overhead. SHA1 is safer than MD5, but takes about 2.5x as much CPU power. However, I/O tends to be the bottleneck (CPU is a close second), so using MD5 instead of SHA1 really won't save you much time.
you can use md5 hash or sha1
function process_dir($path) {
if ($handle = opendir($path)) {
while (false !== ($file = readdir($handle))) {
if ($file != "." && $file != "..") {
if (is_dir($path . "/" . $file)) {
process_dir($path . "/" . $file);
} else {
//you can change md5 to sh1
// you can put that hash into database
$hash = md5(file_get_contents($path . "/" . $file));
}
}
}
closedir($handle);
}
}
if you working in Windows change slashes to backslashes.
Here's a solution I figured out. I didn't do all of it in PHP though it'd be easy enough to do if you wanted:
$fh = popen('find /home/admin -type f | xargs sha1sum', 'r');
$files = array();
while ($line = fgets($fh)) {
list($hash,$file) = explode(' ', trim($line));
$files[$hash][] = $file;
}
$dupes = array_filter($files, function($a) { return count($a) > 1; });
I realise I've not used databases here. How many files are you going to be indexing? Do you need to put that data into a database and then search for the dupes there?
Related
I have the latest TCL build from Active State installed on a desktop and laptop both running Windows 10. I'm new to TCL and a novice developer and my reason for learning TCL is to enhance my value on the F5 platform. I figured a good first step would be to stop the occasional work I do in VBScript and port that to TCL. Learning the language itself is coming along alright, but I'm worried my project isn't viable due to performance. My VBScripts absolutely destroy my TCL scripts in performance. I didn't expect that outcome as my understanding was TCL was so "fast" and that's why it was chosen by F5 for iRules etc.
So the question is, am I doing something wrong? Is the port for Windows just not quite there? Perhaps I misunderstood the way in which TCL is fast and it's not fast for file parsing applications?
My test application is a firewall log parser. Take a log with 6 million hits and find the unique src/dst/port/policy entries and count them; split up into accept and deny. Opening the file and reading the lines is fine, TCL processes 18k lines/second while VBScript does 11k. As soon as I do anything with the data, the tide turns. I need to break the four pieces of data noted above from the line read and put in array. I've "split" the line, done a for-next to read and match each part of the line, that's the slowest. I've done a regexp with subvariables that extracts all four elements in a single line, and that's much faster, but it's twice as slow as doing four regexps with a single variable and then cleaning the excess data from the match away with trims. But even this method is four times slower than VBScript with ad-hoc splits/for-next matching and trims. On my desktop, i get 7k lines/second with TCL and 25k with VBscript.
Then there's the array, I assume because my 3-dimensional array isn't a real array that searching through 3x as many lines is slowing it down. I may try to break up the array so it's looking through a third of the data currently. But the truth is, by the time the script gets to the point where there's a couple hundred entries in the array, it's dropped from processing 7k lines/second to less than 2k. My VBscript drops from about 25k lines to 22k lines. And so I don't see much hope.
I guess what I'm looking for in an answer, for those with TCL experience and general programming experience, is TCL natively slower than VB and other scripts for what I'm doing? Is it the port for Windows that's slowing it down? What kind of applications is TCL "fast" at or good at? If I need to try a different kind of project than reading and manipulating data from files I'm open to that.
edited to add code examples as requested:
while { [gets $infile line] >= 0 } {
some other commands I'm cutting out for the sake of space, they don't contribute to slowness
regexp {srcip=(.*)srcport.*dstip=(.*)dstport=(.*)dstint.*policyid=(.*)dstcount} $line -> srcip dstip dstport policyid
the above was unexpectedly slow. the fasted way to extract data I've found so far
regexp {srcip=(.*)srcport} $line srcip
set srcip [string trim $srcip "cdiloprsty="]
regexp {dstip=(.*)dstport} $line dstip
set dstip [string trim $dstip "cdiloprsty="]
regexp {dstport=(.*)dstint} $line dstport
set dstport [string trim $dstport "cdiloprsty="]
regexp {policyid=(.*)dstcount} $line a policyid
set policyid [string trim $policyid "cdiloprsty="]
Here is the array search that really bogs down after a while:
set start [array startsearch uList]
while {[array anymore uList $start]} {
incr f
#"key" returns the NAME of the association and uList(key) the VALUE associated with name
set key [array nextelement uList $start]
if {$uCheck == $uList($key)} {
##puts "$key CONDITOIN MET"
set flag true
adduList $uCheck $key $flag2
set flag2 false
break
}
}
Your question is still a bit broad in scope.
F5 has published some comment why they choose Tcl and how it is fast for their specific usecases. This is actually a bit different to a log parsing usecase, as they do all the heavy lifting in C-code (via custom commands) and use Tcl mostly as a fast dispatcher and for a bit of flow control. And Tcl is really good at that compared to various other languages.
For things like log parsing, Tcl is often beaten in performance by languages like Python and Perl in simple benchmarks. There are a variety of reasons for that, here are some of them:
Tcl uses a different regexp style (DFA), which are more robust for nasty patterns, but slower for simple patterns.
Tcl has a more abstract I/O layer than for example Python, and usually converts the input to unicode, which has some overhead if you do not disable it (via fconfigure)
Tcl has proper multithreading, instead of a global lock which costs around 10-20% performance for single threaded usecases.
So how to get your code fast(er)?
Try a more specific regular expression, those greedy .* patterns are bad for performance.
Try to use string commands instead of regexp, some string first commands followed by string range could be faster than a regexp for these simple patterns.
Use a different structure for that array, you probably want either a dict or some form of nested list.
Put your code inside a proc, do not put it all in a toplevel script and use local variables instead of globals to make the bytecode faster.
If you want, use one thread for reading lines from file and multiple threads for extracting data, like a typical producer-consumer pattern.
I wrote a code like this (I use MySQL, PDO, InnoDB, Laravel4, localhost & MAC) :
$all_queue = Queue1::all()->toArray(); //count about 10000
ob_end_clean();
foreach($all_queue as $key=>$value) {
$priceCreate=array(...);
Price::create($priceCreate);
Queue1::where('id',$value['id'])->delete();
}
This worked for me (65mg ram usage), but when it was working, other parts of my program(such as other tables) didn't work. I can't open my database on mysql even. My program and my sql wait and when process is completed ,they work.
I don't know what am i supposed to do.
I think this is not for laravel and this is for my php or mysql configuration.
this is my php.ini and mysql config
I assume
$all_foreach($all_queue as $key=>$value) {
Is
foreach($all_queue as $key=>$value) {
And that you have no errors (you have set debug true in your app config).
Try to set no time limit for your script.
In your php.ini
max_execution_time = 3600 ;this is one hour, set to 0 to no limit
Or in code
set_time_limit(0)
And if it's a memory problem try to free memory and unset unused vars. I'ts a good practice in long scripts to free space.
...
}//end foreach loop
unset($all_queue); //no longer needed, so unset it to free memory
I've run into a performance issue with Powershell and generating paramterized queries. I'm pretty sure that I'm going about it wrong and hopefully someone can help.
Here is the problem: inserting 1000 rows (with about 15 columns each) takes around 3 minutes when using parameterized queries. Using non-parameterized batch inserts takes less than a second. The problem seems to lie in how I'm generating the parameters as that is the part of the code that eats most of the time. The looping is killing the performance. The problem wouldn't be a big deal if all I had is 1000 rows to deal with -however, there are millions.
Here is what I have so far. I've omitted quite a bit for the sake of brevity. But, don't get me wrong -it certainly works. Its just painfully slow.
$values = #() #some big array of values
$sqlCommand.CommandText = "INSERT INTO ``my_db``.``$table`` (dataseries_code,dataseries_text) VALUES (?,?),(?,?)" #abbreviated for for the sake of sanity
for($i = 0; $i -lt $values.Count; $i++) {
$sqlCommand.Parameters.AddWithValue($i, $values[$i]) #this is the slow part
}
$insertedRows = $sqlCommand.ExecuteNonQuery()
Write-Host "$insertedRows"
$sqlCommand.Parameters.Clear()
There must be a better way to generate the .Parameters.AddWithValue statement? I took a look at the .AddRange method but couldn't figure how to make it work and if it was even intended for what I'm try to do.
EDIT:
I should have mentioned that I also tried creating the parameters first, and then adding the values. It took 3 minutes to both create the parameters, and 3 minutes to add the values! (see example below). 6 minutes total for 30,000 rows! Something is not right.
I can't help but think that there has to be a faster method to do parameterized inserts?!
#THIS PIECE WILL ONLY RUN ONCE
#this is where we create the parameters first
for($i = 0; $i -lt #($values).Count; $i++) {
$command.Parameters.Add((New-Object MySql.Data.MySqlClient.MySqlParameter($i[MySql.Data.MySqlClient.MySqlParameter]::VarChar))) | out-null
}
#THIS PIECE WILL RUN MULTIPLE TIME
#and this is where we add the values
for($i = 0; $i -lt #($values).Count; $i++) {
$command.Parameters[$i].Value = $values[$i] | Out-Null
}
Thanks!
What about building the command object and adding the parameters, then looping through the values.
Instead of clearing the parameters each time, you could just change the value. That would avoid a lot of object creation/disposal.
I have to work with a big CSV file, up to 2GB. More specifically I have to upload all this data to the mySQL database, but before I have to make a few calculation on that, so I need to do all this thing in MATLAB (also my supervisor want to do in MATLAB because he familiar just with MATLAB :( ).
Any idea how can I handle these big files?
You should probably use textscan to read the data in in chunks and then process. This will probably be more efficient than reading a single line at a time. For example, if you have 3 columns of data, you could do:
filename = 'fname.csv';
[fh, errMsg] = fopen( filename, 'rt' );
if fh == -1, error( 'couldn''t open file: %s: %s', filename, errMsg ); end
N = 100; % read 100 rows at a time
while ~feof( fh )
c = textscan( fh, '%f %f %f', N, 'Delimiter', ',' );
doStuff(c);
end
EDIT
These days (R2014b and later), it's easier and probably more efficient to use a datastore.
There is good advice on handling large datasets in MATLAB in this file exchange item.
Specific topics include:
* Understanding the maximum size of an array and the workspace in MATLAB
* Using undocumented features to show you the available memory in MATLAB
* Setting the 3GB switch under Windows XP to get 1GB more memory for MATLAB
* Using textscan to read large text files and memory mapping feature to read large binary files
I have a perl script in which i create a table from the existing mysql databases. I have hundreds of databases and tables in each database contains millions of records, because of this, sometimes, query takes hours because of problems with indexing, and sometimes i run out disk space due to improper joins. Is there a way i can kill the query in the same script watching for memory consumption and execution time?
P.S. I am using DBI module in perl for mysql interface
As far as execution time, you use alarm Perl functionaility to time out.
Your ALRM handle can either die (see example below), or issue a DBI cancel call (sub { $sth->cancel };)
The DBI documentation actually has a very good discussion of this as well as examples:
eval {
local $SIG{ALRM} = sub { die "TIMEOUT\n" }; # N.B. \n required
eval {
alarm($seconds);
... code to execute with timeout here (which may die) ...
};
# outer eval catches alarm that might fire JUST before this alarm(0)
alarm(0); # cancel alarm (if code ran fast)
die "$#" if $#;
};
if ( $# eq "TIMEOUT\n" ) { ... }
elsif ($#) { ... } # some other error
As far as watching for memory, you just need for ALRM handler - instead of simply dying/cancelling - first checks the memory consumption of your script.
I won't go into details of how to measure memory consumption since it's an unrelated question that was likely already answered comprehensively on SO, but you can use size() method from Proc::ProcessTable as described in the Perlmonks snippet Find memory usage of perl program.
I used KILL QUERY command as described in http://www.perlmonks.org/?node_id=885620
This is the code from my script
eval {
eval { # Time out and interrupt work
my $TimeOut=Sys::SigAction::set_sig_handler('ALRM',sub {
$dbh->clone()->do("KILL QUERY ".$dbh->{"mysql_thread_id"});
die "TIMEOUT\n";
});
#Set alarm
alarm($seconds);
$sth->execute();
# Clear alarm
#alarm(0);
};
# Prevent race condition
alarm(0);
die "$#" if $#;
};
This code kills the query and also removes all the temporary tables
Watch out that you can't kill a query, using the same connection handler if you have a query that is stuck because of a table lock. You must open another connection with the same user and will that thread id.
Of course, you have to store in a hash the list of currently opened thread ids.
Mind that once you've terminated a thread id, the rest of the Perl code will execute .. on a unblessed handler.