comm -23 not deleting all common lines

comm -23 not deleting all common lines - duplicates

I want to delete lines from file 1.txt that are in file 2.txt and save the output to 3.txt,
I am using this bash command:
comm -23 1.txt 2.txt > 3.txt
When I check the output in file 3.txt, I find that some common lines between 1.txt and 2.txt are still in 3.txt, take as an example the word "registry" , what is the problem?
You can download the two files below:
file 1.txt : https://ufile.io/n7vn6
file 2.txt : https://ufile.io/p4s58

comm needs the input to be sorted. You can use process substitution for that:
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
Update, if you additionally have a problem with line endings, you can use sed to align that:
comm -23 <(sed 's/\r//g' 1.txt | sort) <(sed 's/\r//g' 2.txt| sort) > 3.txt

I'm not sure how you generated your text files, but the problem is that some of your 1.txt and 2.txt lines don't have consistent line terminations. Some have a CR character (ctrl-M) but not the sole line feed Linux expects for text files. For example, one of them has registry^M which doesn't match registry (Linux programs that examine text will see ^M as another character or white space but not as a line termination that gets ignored). When you look at the file with some text editors, the ^M isn't visible so it appears registry is the same in both places, but it isn't.
You could try:
dos2unix 1.txt 2.txt
comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt
dos2unix will make all of the line terminations correct (assuming they might be using the DOS CR). Note that this can affect the sort a little, so I'm also resorting them. You can try this without resorting, and if there's an issue comm will give an error that one of the files isn't sorted.

Related

Fart.exe replace two carriage returns in a row

I have csv file where I'm trying to replace two carriage returns in a row with a single carriage return using Fart.exe. First off, is this possible? If so, the text within the CSV is laid out like the below where "CRLF" is an actual carriage return.
,CRLF
CRLF
But I want it to be just this without the extra carriage return on the second line:
,CRLF
I thought I could just do the below but it won't work:
CALL "C:\tmp\fart.exe" -C "C:\tmp\myfile.csv" ,\r\n\r\n ,\r\n
I need to know what to change ,\r\n\r\n to in order to make this work. Any ideas how I could make this happen? Thanks!

As Squashman has suggested, you are simply trying to remove empty lines.
There is no need for a 3rd party tool to do this. You can simply use FINDSTR to discard empty lines:
findstr /v "^$" myFile.txt >myFile.txt.new
move /y myFile.txt.new *. >nul
However, this will only work if all the lines end with CRLF. If you have a unix formatted file that ends each line with LF, then it will not work.
A more robust option would be to use JREPL.BAT - a regular expression command line text processing utility.
jrepl "^$" "" /r 0 /f myFile.txt /o -
Be sure to use CALL JREPL if you put the command within a batch script.
FART processes one line at a time, and the CRLF is not considered to be part of the line. So you can't use a normal FART command to remove CRLF. If you really want to use FART, then you will need to use the -B binary mode. You also need to use -C to get support for the escape sequences.
I've never used FART, so I can't be sure - but I believe the following would work
call fart -B -C myFile.txt "\r\n\r\n" "\r\n"
If you have many consecutive empty lines, then you will need to run the FART command repeatedly until there are no more changes.

Updating files using AWK: Why do I get weird newline character after each replacement?

I have a .csv containing a few columns. One of those columns needs to be updated to the same number in ~1000 files. I'm trying to use AWK to edit each file, but I'm not getting the intended result.
What the original .csv looks like
heading_1,heading_2,heading_3,heading_4
a,b,c,1
d,e,f,1
g,h,i,1
j,k,m,1
I'm trying to update column 4 from 1 to 15.
awk '$4="15"' FS=, OFS=, file > update.csv
When I run this on a .csv generated in excel, the result is a newline ^M character after the first line (which it updates to 15) and then it terminates and does not update any of the other columns.
It repeats the same mistake on each file when running through all files in a directory.
for file in *.csv; do awk '$4="15"' FS=, OFS=, $file > $file"_updated>csv"; done
Alternatively, if someone has a better way to do this task, I'm open to suggestions.

Excel is generating the control-Ms, not awk. Run dos2unix or similar on your file before running awk on it.

Well, I couldn't reproduce your problem in my linux as writing 15 to last column will overwrite the \r (the ^M is actually 0x0D or \r) before the newline \n, but you could always remove the \r first:
$ awk 'sub(/\r/,""); ...' file
I have had some issues with non-ASCII characters processed in a file in a differing locale, for example having a file with ISO-8859-1 encoding processed with Gnu awk in UTF8 shell.

How can I compare columns from two csv files by key by Linux command line?

I have two CSV files:
hogehoge.csv
1,aaa,bbb
2,ccc,ddd
3,eee,fff
4,ggg,hhh
5,iii,jjj
6,kkk,lll
7,mmm,nnn
8,ooo,ppp
9,qqq,rrr
10,sss,ttt
hogehoge2.csv
1,aaa,bb
2,ccc,ddd
3,eee,fff
4,ggg,hhh
5,iii,jjj
7,mmm,nnn
8,ooo,ppp
9,qqq,rrr
10,sss,ttt
I want to get a result like this by command line (diff/cut/awk).
6,kkk,lll
There is a difference on 1st line, but I want to ignore this difference on 1st line.

As the question is stated, you simply want to compare two files line-by-line. comm may be a good choice:
comm -3 hogehoge.csv hogehoge2.csv
If you want to ignore the first line of each file:
comm -3 <(tail -n +2 hogehoge.csv) <(tail -n +2 hogehoge2.csv)
which will print exactly the output you specified. Note: comm -3 will print the lines that differ in each file, and the list of different lines in the second file will be indented with tabs. To remove the tabs:
comm -3 <(tail -n +2 hogehoge.csv) <(tail -n +2 hogehoge2.csv) | sed $'s/\t*//'

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).

This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.

This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"

I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.

aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)

For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.

Split function not working in UNIX

I'm trying to run a split on a file where the filename has spaces in it.
I can't seem to get it to work. So I have the following
SOURCE_FILE="test file.txt"
split -l 100 $SOURCE_FILE
Now I've tried enclosing the $SOURCE_FILE in " with no luck:
split -l 100 "\""$SOURCE_FILE"\""
or even
split -l 100 '"'$SOURCE_FILE'"'
I'm still getting:
usage: split [-l line_count] [-a suffix_length] [file [name]]
or: split -b number[k|m] [-a suffix_length] [file [name]]

You're trying too hard! A single set of double quotes will suffice:
split -l 100 "$SOURCE_FILE"
You want the arguments to split to look like this:
-l
100
test file.txt
The commands you were trying both yield these arguments:
-l
100
"test
file.txt"
As in, they are equivalent to this incorrect command:
split -l 100 '"test' 'file.txt"'

Or you could insert a backslash to escape the embedded space:
SOURCE_FILE=test\ file.txt
split -l 100 "$SOURCE_FILE"

I assume you tried just "$SOURCE_FILE" without the fancy escaping tricks?
I think I would try cat-ing the file into split, maybe split just has issues with files with spaces in their name, or maybe it is really pissed off about something other than the space.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

comm -23 not deleting all common lines - duplicates

comm needs the input to be sorted. You can use process substitution for that: comm -23 <(sort 1.txt) <(sort 2.txt) > 3.txt Update, if you additionally have a problem with line endings, you can use sed to align that: comm -23 <(sed 's/\r//g' 1.txt | sort) <(sed 's/\r//g' 2.txt| sort) > 3.txt

Related

Fart.exe replace two carriage returns in a row

Updating files using AWK: Why do I get weird newline character after each replacement?

How can I compare columns from two csv files by key by Linux command line?

Can aspell output line number and not offset in pipe mode?

Split function not working in UNIX

Categories

Resources