How to remove duplicates in a CSV file? - csv

I have a large file with a bunch of movie data, including a unique ID for each movie. although every ID on each line is unique, some lines include duplicate movie data.
For example:
ID,movie_title,year
1,toy story,1995
2,jumanji,1995
[...]
6676,toy story,1995
6677,jumanji,1995
In this case, I'd like to remove completly the 6677,toy story,1995 and 6677,jumanji,1995 lines. This occurs with more than just one movie, so I can't do a simple find and replace. I've tried to use Sublime Text's Edit>Permute Lines>Unique feature and it works fine, but I end up losing the first column of the data (the unique IDs).
can anyone recommend a better way to get rid of these duplicate lines?

The following perl script does the trick. Effectively, all occurrences of a movie but the first will be deleted from the list of entries. Do not forget to add the file paths. Execute with 'perl ' from the command line (mac os ships with perl):
use IO::File;
my (
$curline
, $fh_in
, $fh_out
, $dict
, #fields
, $key
, $value
);
$fh_in = new IO::File("<..."); # add input file name
$fh_out = new IO::File(">..."); # add output file name
while (<$fh_in>) {
chomp;
$curline = $_;
#fields = split ( /,/, $curline );
($key, $value) = (join(',', #fields[1..$#fields]), $fields[0]);
if (!exists($$dict{$key})) {
$$dict{$key} = 1;
$fh_out->print("$curline\n");
}
}
$fh_out->close();
exit(0);
Explanation
The code processes the input line by line
It maintains an hash of movie identifiers seen.
Movie identifiers are defined as the line content without the id number and the immediately following comma.
A line is printed iff the movie identifier has not yet been seen.
Caveat
Evidently, this solution is not robust against spelling errors.
A certain degree of error tolerance can be added by normalizing keys. Example (case-insensitive matching):
my $key_norm; # move that out of the loop in production code
$key_norm = lc($key);
if (!exists($$dict{$key_norm})) {
$$dict{$key_norm} = 1;
$fh_out->print("$curline\n");
}
Neither elegance nor performance had a say in authoring this code ;-)

Related

In Octave, how to load variable and value from txt file ?

First of all, thanks for helping me.
My question, lets say I have readme.txt file, and inside looks like below
a1 3
b2 4
c3 -2.3
d23 55.6
Now, how can I make a function to load this txt file, so that in the Octave, I will directly have
a1=3
b2=4
c3=-2.3
d23=55.6
Allow me say this again, DIRECTLY. Once I use this function "readFunction("readme.txt")", all those variables will be load, and ready to use.
I tried [name, num] = textread ("readme.txt", "%s %f"), "num" is the numbers, but I don't know how to convert the cell "name" to variable name. e.g it's wrong if I do char(name(1)) = b(1). (trying to do a1 = 3).
Or maybe my way is just completely wrong? Thanks for the help.
First, the idea:
Read line by line the file
Replace the space with and = symbol
Optionally add a ; symbol at the end of each line
eval the string.
Second, a simple code example:
filename="file.txt";
fid=fopen(filename);
line=fgetl(fid);
while line != -1
eval(strrep(line, ' ', '='));
line=fgetl(fid);
endwhile
fclose(fid);

Import text file into SQL database

I have a number of separate text files which i would like to import into an SQL database. The data is not comma separted so that rules out using my idea of importing data by comma. However, the data is across a number of rows. See example text file below. Please could anyone advise how i could import specific data such as the programmed and mean values, shift number, etc?
It looks like you have a machine-generated report. The ideal approach is to have that machine produce a different report--one that has no '/////' or any of that crap, just the data you want to import. So that new report's output might look like this.
shift_num, prog_min, mean_sec, att_sec, adt_min
1, 600, 599, 658, 210
...
In practice, though, it's often not "possible" to get reports like that. (That is, it's always possible for the machine to do it, but often humans are unwilling.) When that happens, use your favorite text-processing language to turn the report into usable data.
I like awk for this kind of stuff. Others like perl.
To illustrate, I keyed in this replica of your report. (Saved as test.dat.)
ORDER Nr FG68909 Q.ty Ordered 99
...
SHIFT Nr. 1
////////
PROGRAMMED MEAN
600 min JOB TIME 599 sec
AVERAGE Turnaround Time 658 sec
AVERAGE Delivery Time 210 mins
Then I wrote this awk program. It makes a lot of assumptions about the layout of your report. Some of them will probably fail on real data.
/SHIFT/ {shift = $NF}
/JOB TIME/ {
programmed = sprintf("%d %s", $1, $2);
mean = sprintf("%d %s", $(NF-1), $NF);
}
/AVERAGE Turnaround/ { avg_turnaround = sprintf("%d %s", $(NF-1), $NF);}
# Assumes the line "AVERAGE Delivery" is also the end of the record.
/AVERAGE Delivery/ {
avg_delivery = sprintf("%d %s", $(NF-1), $NF);
printf("%d, '%s', '%s', '%s', '%s'\n", shift, programmed, mean, avg_turnaround, avg_delivery);
# Clear the vars for the next record.
shift = "";
programmed = "";
mean = "";
avg_turnaround = "";
avg_delivery = "";
}
The output . . .
$ awk -f test.awk test.dat
1, '600 min', '599 sec', '658 sec', '210 mins'
You could write a simple application in C# to parse the contents of the file using regex, turn it into one line, and insert semicolons where required.

Why does my use of Perl's split function not split?

I'm trying to split an HTML document into its head and body:
my #contentsArray = split( /<\/head>/is, $fileContents, 1);
if( scalar #contentsArray == 2 ){
$bodyContents = $dbh->quote(trim($contentsArray[1]));
$headContents = $dbh->quote(trim($contentsArray[0]) . "</head>");
}
is what i have. $fileContents contains the HTML code. When I run this, it doesn't split. Any one know why?
The third parameter to split is how many results to produce, so if you want to apply the expression only once, you would pass 2.
Note that this does actually limit the number of times the pattern is used to split the string (to one fewer than the number passed), not just limit the number of results returned, so this:
print join ":", split /,/, "a,b,c", 2;
outputs:
a:b,c
not:
a:b
sorry, figured it out. Thought the 1 was how many times it would find the expression not limit the results. Changed to 2 and works.

Is there a clever way to parse plain-text lists into HTML?

Question: Is there a clever way to parse plain-text lists into HTML?
Or, must we resort to esoteric recursive methods, or sheer brute force?
I've been wondering this for a while now. In my own ruminations I have come back again and again to the brute-force, and odd recursive, methods ... but it always seems so clunky. There must be a better way, right?
So what's the clever way?
Assumptions
It is necessary to set up a scenario, so these are my assumptions.
Lists may be nested 3 levels deep (at a minimum), of either unordered or ordered lists. The list type and depth is controlled by its prefix:
There is a mandatory space following the prefix.
List depth is controlled by how many non-spaced characters there are in the prefix; ***** would be nested five lists deep.
List type is enforced by character type, * or - being an unordered list, # being a disordered list.
Items are separated by only 1 \n character. (Lets pretend two consecutive new-lines qualify as a "group", a paragraph, div, or some other HTML tag like in Markdown or Textile.)
List types may be freely mixed.
Output should be valid HTML 4, preferably with ending </li>s
Parsing can be done with, or without, Regex as desired.
Sample Markup
* List
*# List
** List
**# List
** List
# List
#* List
## List
##* List
## List
Desired Output
Broken up a bit for readability, but it should be a valid variation of this (remember, that I'm just spacing it nicely!):
<ul>
<li>List</li>
<li>
<ol><li>list</li></ol>
<ul><li>List</li></ul>
</li>
<li>List</li>
<li>
<ol><li>List</li></ol>
</li>
<li>List</li>
</ul>
<ol>
<li>List</li>
<li>
<ul><li>list</li></ul>
<ol><li>List</li></ol>
</li>
<li>List</li>
<li>
<ul><li>List</li></ul>
</li>
<li>List</li>
</ol>
In Summary
Just how do you do this? I'd really like to understand the good ways to handle unpredictably recursing lists, because it strikes me as an ugly mess for anyone to tangle with.
Basic iterative technique:
A regex or some other simple parser that'll recognize the format for a list, capturing each list item (including those with additional levels of indentation).
A counter to keep track of the current indentation level.
Logic to iterate through each capture, writing out <li>s and inserting appropriate begin / end tags (<ol></ol>, <ul></ul>) and incrementing / decrementing the indentation counter whenever the current indentation level is greater or less than the previous one.
Edit: Here's a simple expression that'll probably work for you with a bit of tweaking: each match is a top-level list, with two sets of named captures, the markers (char count is indentation level, last char indicates desired list type) and the list item text.
(?:(?:^|\n)[\t ]*(?<marker>[*#]+)[\t ]*(?<text>[^\n\r]+)\r*(?=\n|$))+
The line-by-line solution with some pythonic concepts:
cur = ''
for line in lines():
prev = cur
cur, text = split_line_into_marker_and_remainder(line)
if cur && (cur == prev) :
print '</li><li>'
else :
nprev, ncur = kill_common_beginning(prev, cur)
for c in nprev: print '</li>' + ((c == '#') ? '</ol>' : '</ul>')
for c in ncur: print ((c == '#') ? '<ol>' : '<ul>' ) + '<li>'
print text
This is how it works: to process the line, I compare the marker for previous line with the marker for this line.
I use a fictional function split_line_into_marker_and_remainder, which returns two results, marker cur and the text itself. It's trivial to implement it as a C++ function with 3 arguments, an input and 2 output strings.
At the core is a fictional function kill_common_beginning which would take away the repeat part of prev and cur. After that, I need to close everything that remains in previous marker and open everything that remains in current marker. I can do it with a replace, by mapping characters to string, or by a loop.
The three lines wil be pretty straightforward in C++:
char * saved = prev;
for (; *prev && (*prev == *cur); prev++, cur++ ); // "kill_common_beginning"
while (*prev) *(prev++) == '#' ? ...
while (*cur) *(cur++) == '#' ? ...
cur = saved;
Note, however, that there is a special case: when the indentation didn't change, those lines don't output anything. That's fine if we're outside of the list, but that's not fine in the list: so in that case we should output the </li><li> manually.
Best explanation I've seen is from Higher-Order Perl by Mark Jason Dominus. The full text is available online at http://hop.perl.plover.com/book/.
Though the examples are all in Perl, the breakdown of the logic behind each area is fantastic.
Chapter 8 (! PDF link) is specifically about parsing. Though the lessons through out the book are somewhat related.
Look at Textile.
It is available in a number of languages.
This how you can do it with regexp and cycle (^ stands for newline, $ for endline):
do {
^#anything$ -> <ol><li>$^anything</li></ol>$
^*anything$ -> <ul><li>$^anything</li></ul>$
} while any of those above applies
do {
</ol><ol> ->
</ul><ul> ->
</li><li> ->
} while any of those above applies
This makes it much simpler than a simple regexp. The way it works: you first expand each line as if it was isolated, but then eat extra list markers.
Here is my own solution, which seems to be a hybrid of Shog9's suggestions (a variation on his regex, Ruby doesn't support named matches) and Ilya's iterative method. My working language was Ruby.
Some things of note: I used a stack-based system, and that "String#scan(pattern)" is really just a "match-all" method that returns an array of matches.
def list(text)
# returns [['*','text'],...]
parts = text.scan(/(?:(?:^|\n)([#*]+)[\t ]*(.+)(?=\n|$))/)
# returns ul/ol based on the byte passed in
list_type = lambda { |c| (c == '*' ? 'ul' : 'ol') }
prev = []
tags = [list_type.call(parts[0][0][0].chr)]
result = parts.inject("<#{tags.last}><li>") do |output,newline|
unless prev.count == 0
# the following comparison says whether added or removed,
# this is the "how much"
diff = (prev[0].length - newline[0].length).abs
case prev[0].length <=> newline[0].length
when -1: # new tags to add
part = ((diff > 1) ? newline[0].slice(-1 - diff,-1) : newline[0][-1].chr)
part.each_char do |c|
tags << list_type.call(c)
output << "<#{tags.last}><li>"
end
when 0: # no new tags... but possibly changed
if newline[0] == prev[0]
output << '</li><li>'
else
STDERR.puts "Bad input string: #{newline.join(' ')}"
end
when 1: # tags removed
diff.times{ output << "</li></#{tags.pop}>" }
output << '</li><li>'
end
end
prev = newline
output + newline[1]
end
tags.reverse.each { |t| result << "</li></#{t}>" }
result
end
Thankfully this code does work and generate valid HTML. And this did turn out better than I had anticipated. It doesn't even feel clunky.
This Perl program is a first attempt at that.
#! /usr/bin/env perl
use strict;
use warnings;
use 5.010;
my $data = [];
while( my $line = <> ){
last if $line =~ /^[.]{3,3}$/;
my($nest,$rest) = $line =~ /^([\#*]*)\s+(.*)$/x;
my #nest = split '', $nest;
if( #nest ){
recourse($data,$rest,#nest);
}else{
push #$data, $line;
}
}
de_recourse($data);
sub de_recourse{
my($ref) = #_;
my %de_map = (
'*' => 'ul',
'#' => 'ol'
);
if( ref $ref ){
my($type,#elem) = #$ref;
if( ref $type ){
for my $elem (#$ref){
de_recourse($elem);
}
}else{
$type = $de_map{$type};
say "<$type>";
for my $elem (#elem){
say "<li>";
de_recourse($elem);
say "</li>"
}
say "</$type>";
}
}else{
print $ref;
}
}
sub recourse{
my($last_ref,$str,#nest) = #_;
die unless #_ >= 2;
die unless ref $last_ref;
my $nest = shift #nest;
if( #_ == 2 ){
push #$last_ref, $str;
return;
}
my $previous = $last_ref->[-1];
if( ref $previous ){
if( $previous->[0] eq $nest ){
recourse( $previous,$str,#nest );
return;
}
}
my $new_ref = [ $nest ];
push #$last_ref, $new_ref;
recourse( $new_ref, $str, #nest );
}
Hope it helps
Try Gelatin. The syntax definition would probably be 5 lines or less.

Best way to find illegal characters in a bunch of ISO-889-1 web pages?

I have a bunch of html files in a site that were created in the year 2000 and have been maintained to this day. We've recently began an effort to replace illegal characters with their html entities. Going page to page looking for copyright symbols and trademark tags seems like quite a chore. Do any of you know of an app that will take a bunch of html files and tell me where I need to replace illegal characters with html entities?
You could write a PHP script (if you can; if not, I'd be happy to help), but I assume you already converted some of the "special characters", so that does make the task a little harder (although I still think it's possible)...
Any good text editor will do a file contents search for you and return a list of matches.
I do this with EditPlus. There are several editors like Notepad++, TextPad, etc that will easily help you do this.
You do not have to open the files. You just specify a path where the files are stored and the Mask (*.html) and the contents to search for "©" and the editor will come back with a list of matches and when you double click, it opens the file and brings up the matching line.
I also have a website that needs to regularly convert large numbers of file names back and forth between character sets. While a text editor can do this, a portable solution using 2 steps in php was preferrable. First, add the filenames to an array, then do the search and replace. An extra piece of code in the function excludes certain file types from the array.
Function listdir($start_dir='.') {
$nonFilesArray=array('index.php','index.html','help.html'); //unallowed files & subfolders
$filesArray = array() ; // $filesArray holds new records and $full[$j] holds names
if (is_dir($start_dir)) {
$fh = opendir($start_dir);
while (($tmpFile = readdir($fh)) !== false) { // get each filename without its path
if (strcmp($tmpFile, '.')==0 || strcmp($tmpFile, '..')==0) continue; // skip . & ..
$filepath = $start_dir . '/' . $tmpFile; // name the relative path/to/file
if (is_dir($filepath)) // if path/to/file is a folder, recurse into it
$filesArray = array_merge($filesArray, listdir($filepath));
else // add $filepath to the end of the array
$test=1 ; foreach ($nonFilesArray as $nonfile) {
if ($tmpFile == $nonfile) { $test=0 ; break ; } }
if ( is_dir($filepath) ) { $test=0 ; }
if ($test==1 && pathinfo($tmpFile, PATHINFO_EXTENSION)=='html') {
$filepath = substr_replace($filepath, '', 0, 17) ; // strip initial part of $filepath
$filesArray[] = $filepath ; }
}
closedir($fh);
} else { $filesArray = false; } # no such folder
return $filesArray ;
}
$filesArray = listdir($targetdir); // call the function for this directory
$numNewFiles = count($filesArray) ; // get number of records
for ($i=0; $i<$numNewFiles; $i++) { // read the filenames and replace unwanted characters
$tmplnk = $linkpath .$filesArray[$i] ;
$outname = basename($filesArray[$i],".html") ; $outname = str_replace('-', ' ', $outname);
}