I have two notepads and each notepad contains some data. Let's say Notepad 1 and Notepad 2
Notepad 1 contains: A, B, C
Notepad 2 contains: C, D, E
I want to ask that how can i find data in Notepad 2 that contains notepad 1 data. Here answer is C. But i have lots of data in notepad 1 and notepad2. It is not possible to take individual data from notepad 1 and to press Ctrl+F in notepad 2 to find data. Is there any suitable method for this? Will it be possible by converting these notepads into html pages?
This can be done with the comm(1) tool:
$ cat F1
A
B
C
$ cat F2
C
D
E
$ comm -12 F1 F2
C
$
The -1 suppresses all lines unique to the first file. -2 suppresses all lines unique to the second file. All that is left is lines common to both.
Probably you'd like to have a look on diff/merge tools. WinMerge is a free one. Another good option is Araxis Merge, it is commercial. Also you can just use Notepad++ editor with its Compare plugin. These tools are GUI based and can help you if you want to see and edit difference.
If you need to extract and somehow automatically process difference, you more likely will have to use some console tools and scripting. *nix diff command can be used to extract difference and there are a lot of scripting languages suitable for text processing: sed, AWK, Perl, Python for instance.
Related
I have a phased .vcf file generated by longshot from a MinION sequencing run of diploid, human DNA. I would like to be able to split the file into two haploid files, one for haplotype 1, one for haplotype 2.
Do any of the VCF toolkits provide this function out of the box?
3 variants from my file:
##fileformat=VCFv4.2
##source=Longshot v0.4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 161499264 . G C 500.00 PASS DP=55;AC=27,27 GT:GQ:PS:UG:UQ 0|1:500.00:161499264:0/1:147.24
chr1 161502368 . A G 500.00 PASS DP=43;AC=4,38 GT:GQ:PS:UG:UQ 1/1:342.00:.:1/1:44.91
chr1 161504083 . A C 346.17 PASS DP=39;AC=19,17 GT:GQ:PS:UG:UQ 1|0:346.17:161499264:0/1:147.24
To extract haplotypes from phased vcf files, you can use samplereplay from RTGtools to generate the haplotype SDF file; then sdf2sam, sdf2fasta, and sdf2fastq to obtain corresponding files of phased haplotypes.
Edit: I haven't noticed that you needed a haploid VCF file. The method above should work if you first convert it to sam then to a VCF again.
I didn't find a tool so I coded something (not pretty but works)
awk '{if ($1 ~ /^##/) print; \
else if ($1=="#CHROM") { ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) {print $i"_A\t"$i"_B"}; ORS="\n"; print $NF"_A\t"$NF"_B"}\
else {ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) print substr($i,0,1)"\t"substr($i,3,1); \
ORS="\n"; print substr($NF,0,1)"\t"substr($NF,3,1)"\n"} }' VCF_FILE
First line to print the header.
On the third line I duplicated the name of the individuals (with NAME_A and NAME_B but you can change it.
Fifth line, I keep only the GT with substr().
If you want to keep the other info you can use substr() as well.
For example: substr($i,0,1)substr($i,4,100) will keep the info of the first GT and other fields.
I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3
How do i check for duplicates on two files (file1.txt & file2.txt)?
So there are same lines on file1.txt that are on file2.txt also, i want the duplicates same as on file2.txt to be removed from file1.txt, so on the file1.txt i only would have left not-duplciates?
Can i use notepad++ (even plugins) or any other software to do this?
I hope i was clear, and i guess there could be an easy solution for this, its just that my brain ain't working right now.
You could create a 3rd file and use the TextFX plugin to sort line unique
(credit to the TextFX usage is to https://www.cathrinewilhelmsen.net/2012/05/16/notepad-remove-duplicates-remove-blank-lines-sort-data/)
To combine the 2 text files into one you can run from cmd:
type file1.txt > file3.txt
type file2.txt >> file3.txt (note the double >>)
Then open file3.txt in notepad++ and follow use TextFX
I'm trying to convert a HTML file to a PDF by using the Mac terminal.
I found a similar post and I did use the code they provided. But I kept getting nothing. I did not find the output file anywhere when I issued this command:
./soffice --headless --convert-to pdf --outdir /home/user ~/Downloads/*.odt
I'm using Mac OS X 10.8.5.
Can someone show me a terminal command line that I can use to convert HTML to PDF?
I'm trying to convert a HTML file to a PDF by using the Mac terminal.
Ok, here is an alternative way to do convert (X)HTML to PDF on a Mac command line. It does not use LibreOffice at all and should work on all Macs.
This method (ab)uses a filter from the Mac's print subsystem, called xhtmltopdf. This filter is usually not meant to be used by end-users but only by the CUPS printing system.
However, if you know about it, know where to find it and know how to run it, there is no problem with doing so:
The first thing to know is that it is not in any desktop user's $PATH. It is in /usr/libexec/cups/filter/xhtmltopdf.
The second thing to know is that it requires a specific syntax and order of parameters to run, otherwise it won't. Calling it with no parameters at all (or with the wrong number of parameters) it will emit a small usage hint:
$ /usr/libexec/cups/filter/xhtmltopdf
Usage: xhtmltopdf job-id user title copies options [file]
Most of these parameter names show that the tool clearly related to printing. The command requires in total at least 5, or an optional 6th parameter. If only 5 parameters are given, it reads its input from <stdin>, otherwise from the 6ths parameter, a file name. It always emits its output to <stdout>.
The only CLI params which are interesting to us are number 5 (the "options") and the (optional) number 6 (the input file name).
When we run it on the command line, we have to supply 5 dummy or empty parameters first, before we can put the input file's name. We also have to redirect the output to a PDF file.
So, let's try it:
/usr/libexec/cups/filter/xhtmltopdf "" "" "" "" "" my.html > my.pdf
Or, alternatively (this is faster to type and easier to check for completeness, using 5 dummy parameters instead of 5 empty ones):
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html > my.pdf
While we are at it, we could try to apply some other CUPS print subsystem filters on the output: /usr/libexec/cups/filter/cgpdftopdf looks like one that could be interesting. This additional filter expects the same sort of parameter number and orders, like all CUPS filters.
So this should work:
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 "" \
> my.pdf
However, piping the output of xhtmltopdf into cgpdftopdf is only interesting if we try to apply some "print options". That is, we need to come up with some settings in parameter no. 5 which achieve something.
Looking up the CUPS command line options on the CUPS web page suggests a few candidates:
-o number-up=4
-o page-border=double-thick
-o number-up-layout=tblr
do look like they could be applied while doing a PDF-to-PDF transformation. Let's try:
/usr/libexec/cups/filter/xhtmltopdfcc 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 5 \
"number-up=4 page-border=double-thick number-up-layout=tblr" \
> my.pdf
Here are two screenshots of results I achieved with this method. Both used as input files two HTML files which were identical, apart from one line: it was the line which referenced a CSS file to be used for rendering the HTML.
As you can see, the xhtmltopdf filter is able to (at least partially) take into account CSS settings when it converts its input to PDF:
Starting 3.6.0.1 , you would need unoconv on the system to converts documents.
Using unoconv with MacOS X
LibreOffice 3.6.0.1 or later is required to use unoconv under MacOS X. This is the first version distributed with an internal python script that works. No version of OpenOffice for MacOS X (3.4 is the current version) works because the necessary internal files are not included inside the application.
I just had the same problem, but I found this LibreOffice help post. It seems that headless mode won't work if you've got LibreOffice (the usual GUI version) running too. The fix is to add an -env option, e.g.
libreoffice "-env:UserInstallation=file:///tmp/LibO_Conversion" \
--headless \
--invisible \
--convert-to csv file.xls
I am currently using a windows utility called TableTexCompare
This tool can take 2 CSV files and compare them. The nice thing about it is that it can make the comparison even if the records of the 2 files are not sorted in the same order or the fields are not positioned in the same order.
As such, the following 2 files would result in a successful comparison
(File1.csv)
FirstName,LastName,Age
Mona,Sax,30
Max,Payne,43
Jack,Lupino,50
(File2.csv)
FirstName,Age,LastName
Max,43,Payne
Jack,50,Lupino
Mona,30,Sax
What I am looking for is to do the same thing from the command-line with just 1 difference:
I would like the comparison to be performed in one direction only, i.e. if File2.csv is as follows (a subset of File1.csv), the comparison should pass
(File2.csv)
FirstName,Age,LastName
Jack,50,Lupino
I do not particularly care if it's going to be in some programming language, a dedicated cli tool or a shell script (e.g. using awk). I have some experience with Java and Groovy but would like to be pointed to some initial direction.
I can offer a Python solution:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
for item in r2:
if not item in r1:
print("r2 is not a subset of r1!")
break
This is actually a bit more verbose than necessary in Python (but easier to understand); I personally would have used a generator expression:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
if all(item in r1 for item in r2):
print("r2 is a subset of r1")
If you can afford to do a case insensitive comparison, and if there are no duplicates within File2.csv that must be matched within File1.csv, and if File1.csv does not contain \\ or \", then all you need is a simple FINDSTR command.
The following will list lines in File2.csv that do not appear in File1.csv:
findstr /vxig:"File1.csv" "File2.csv"
If all you want is an indication whether File1.csv is a superset of File2.csv, then
findstr /vxig:"File1.csv" "File2.csv" >nul && (echo File1 is NOT a superset of File2) || (echo File1 IS a superset of File2)
The search should not have to be case insensitive, except there is a nasty FINDSTR bug: it may fail to find matches when there are multiple case sensitive literal search strings of varying size. The case insensitive option avoids the bug. See Why doesn't this FINDSTR example with multiple literal search strings find a match? for more info.
The search will not work properly if File2.csv contains \\ or \" because FINDSTR will treat them as \ and " respectively. See What are the undocumented features and limitations of the Windows FINDSTR command? for more info. The accepted answer has sections describing FINDSTR escape sequences about half way down.
You can take a look at q - Text as a Database , which allows executing SQL directly on csv files, including joins. This will allow doing a compare easily, and much more, such as matching specific columns for equality, and getting specific columns from rows that don't match etc.
Full disclosure - It's my own open source tool.
Harel Ben-Attia