Search and replace a csv in column mode in vim - csv

I couldn't find any reasonable answer for my specific question; I know how to deal with search of replace of the vim/sed, but how do we deal with csv in vim regarding the column mode search and replace. i.e. we have chunk of data in csv as :
automotive_bitcount,1,1,0,1,0,0,0,0,1,0,0,1,1,0,1,0,1,1,1,1,0,0,1,0,0
automotive_bitcount,1,0,0,1,0,1,0,1,1,1,0,0,0,0,1,1,1,0,1,0,1,0,0,1,0
automotive_bitcount,2,1,0,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,1,1,1,1,0,1,0
automotive_bitcount,2,0,0,0,1,1,0,0,1,0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0
automotive_bitcount,2,1,0,0,0,1,1,0,1,0,1,1,0,0,1,1,1,0,1,1,1,0,1,1,1
which represents for the header:
APP_NAME, DATASET, COMPILER FLAGS#1,...,COMPILER FLAG#24
Here is the statement of the search and replace; I would like to replace those "1" in columns with the corresponding "Compiler flag options" (which I put down here) so at the nend I could have something like this structure in order to pass them to the compiler:
automotive_bitcount dataset1 -fno-guess-branch-probability -fno-ivopts -fno-tree-loop-optimize -fno-inline-functions -fno-omit-frame-pointer -falign-jumps -fselective-scheduling -fno-tree-pre -fno-move-loop-invariants
Just for the record, the 24 compiler flags are as follows (in their orders):
compilerOptionList= "-funsafe-math-optimizations -fno-guess-branch-probability -fno-ivopts -fno-tree-loop-optimize -fno-inline-functions -funroll-all-loops -fno-omit-frame-pointer -falign-jumps -fselective-scheduling -fno-inline-small-functions -fno-tree-pre ftracer -fno-move-loop-invariants -O2 -fno-tree-ter -fprefethch-loop-arrays -max-unrolled-insns -fno-inline-functions-called-once -fno-cprop-registers -finline-functions -fno-schedule -fno-align-functions -fno-tree-dce -fno-merge-constants"

The csv.vim - A Filetype plugin for csv files plugin has a substitute command that is scoped to a certain column:
:[range]Substitute [column/]pattern/string[/flags]

Related

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Saving specific Excel sheet as .csv

I am trying figure out how to save a specific Excel sheet as CSV via command line on Linux.
I am able to save the first sheet with the command below:
libreoffice --headless --convert-to csv --outdir /tmp /tmp/test.xls
It seems that there should be a way to specify the sheet I want to save, but I am not able to find one.
Is there a way to save it via LibreOffice?
I know OP has probably moved on by now but since this was the first result in my search, I figured I'd take a stab at leaving an answer that works and is actually usable for the next googler.
First, LibreOffice still only lets you save the first sheet. If that is all you need, then try libreoffice --convert-to csv Test.ods. Interestingly, the GUI does the same thing - only letting you export the active sheet. So it's not that terminal is ignored so much that it is just a limitation in LibreOffice.
I needed to extract several sheets into separate csv files so "active sheet only" didn't cut it for me. After seeing this answer only had a macro as the suggestion, I kept looking. There were a few ways to get the other sheets in various places I found after this page. I don't recall any of them that allowed you to extract a specific sheet (unless it was some random github tool that I skipped over).
I liked the method of using the Gnumeric spreadsheet application because it is in most central repos and doesn't involve converting to xsl / xslx first. However, there are a few caveats to be aware of.
First, if you want to be able to extract only one sheet without knowing the sheet name ahead of time then this won't work. If you do know the sheet name ahead or time or are ok with extracting all the sheets, then this works fairly well. The sheet name can be used to create the output files so it's not completely lost which is nice too.
Second, if you want the quoting style to match the same style you'd get by manually exporting from the LibreOffice GUI, then you will need to forget the term "csv" and think in terms of "txt" until you finish the conversion (e.g. convert to .txt files then rename them). Otherwise, if you don't care about an exact match on quoting style, then this doesn't matter. I will show both ways below. If you don't know what a quoting style is, basically in csv if you have spaces or a string that contains , you put quotes around the cell value to distinguish from the commas used to separate text. Some programs quote everything, others quote if there are spaces and/or commas in the value, and others don't quote at all (or only quote for commas?).
Last, there seems to be a difference in the precision when converting via LibreOffice and Gnumeric's ssconvert tool. Not enough to matter for most people, for most use-cases. But still worth noting. In my original ods file, I had a formula that was taking the average of 3 cells with 58.14, 59.1, and 59.05 respectfully. This average came to 58.7633333333333 when I exported via the LibreOffice GUI. With ssconvert, the same value was 58.76333333333333 (e.g. it had one additional decimal place compared to LibreOffice version). I didn't really care for my purposes but if you need to exactly match LibreOffice or don't want the extra precision, then I guess it might matter.
From man ssconvert, we have the following options:
-S, --export-file-per-sheet: Export a file for each sheet if the exporter only supports one sheet at a time. The output filename is treated as a template in which sheet number is substituted for %n, sheet name is substituted for %s, and sheet object name is substituted for %o in case of graph export. If there are no substitutions, a default of ".%n" is added.
-O, --export-options=optionsstring : Specify parameters for the chosen exporter. optionsstring is a list of parameter=value pairs, separated by spaces. The parameter names and values allowed are specific to the exporter and are documented below. Multiple parameters can be specified
During my testing, the -O options were ignored if I specified the output file with a .csv extension. But if I used .txt then they worked fine.
I'm not covering them all and I'm paraphrasing so read the man page if you want more details. But some of the options you can provide in the optionsstring are as follows:
sheet: Name of the sheet. You can repeat this option for multiple sheets. In my testing, using indexes did NOT work.
separator: If you want a true comma separated values files, then we'll need to use commas.
format: I'll be using raw bc I want the unformatted values. If you need something special for dates, etc read the man page.
quoting-mode: when to quote values. can be always, auto, or never. If you want to mimic LibreOffice as closely as possible, choose never.
So let's get to a terminal.
# install gnomic on fedora
$ sudo dnf install -y gnumeric
# install gnomic on ubuntu/mint/debian
$ sudo apt-get install -y gnumeric
# use the ssconvert util from gnumeric to do the conversion
# let it do the default quoting - this will NOT match LibreOffice
# in this example, I am just exporting 1 named sheet using
# -S, --export-file-per-sheet
$ ssconvert -S -O 'sheet=mysheet2' Test.ods test_a_%s.csv
$ ls *.csv
test_a_mysheet2.csv
# same thing but more closely mimicking LibreOffice output
$ ssconvert -S -O 'sheet=mysheet2 separator=, format=raw quoting-mode=never' Test.ods test_b_%s.txt;
$ mv test_b_mysheet2.txt test_b_mysheet2.csv;
# Q: But what if I don't know the sheet names?
# A: then you'll need to export everything
# notice the 'sheet' option is removed from the
# list of -O options vs previous command
$ ssconvert -S -O 'separator=, format=raw quoting-mode=never' Test.ods test_c_%n_%s.txt;
$ ls test_c*
test_c_0_mysheet.txt test_c_3_yoursheet2.txt
test_c_1_mysheet2.txt test_c_4_yoresheet.txt
test_c_2_yoursheet.txt test_c_5_holysheet.txt
# Now to rename all those *.txt files to *.csv
$ prename 's/\.txt/\.csv/g' test_c_*.txt
$ ls test_c*
test_c_0_mysheet.csv test_c_3_yoursheet2.csv
test_c_1_mysheet2.csv test_c_4_yoresheet.csv
test_c_2_yoursheet.csv test_c_5_holysheet.csv
Command:
soffice --headless "macro:///Library1.Module1.ConvertSheet(~/Desktop/Software/OpenOffice/examples/input/Test1.ods, Sheet2)"
Code:
Sub ConvertSheet( SpreadSheetPath as String, SheetNameSeek as String)
REM IN SpreadSheetPath is the FULL PATH and file
REM IN SheetName sheet name to be found and converted to CSV
Dim Doc As Object
Dim Dummy()
SheetNameSeek=trim(SheetNameSeek)
If (Not GlobalScope.BasicLibraries.isLibraryLoaded("Tools")) Then
GlobalScope.BasicLibraries.LoadLibrary("Tools")
End If
REM content of an opened window can be replaced with the help of the frame parameter and SearchFlags:
SearchFlags = com.sun.star.frame.FrameSearchFlag.CREATE + _
com.sun.star.frame.FrameSearchFlag.ALL
REM Set up a propval object to store the filter properties
Dim Propval(1) as New com.sun.star.beans.PropertyValue
Propval(0).Name = "FilterName"
Propval(0).Value = "Text - txt - csv (StarCalc)"
Propval(1).Name = "FilterOptions"
Propval(1).Value = "44,34,76,1"
Url=ConvertToUrl(SpreadSheetPath)
Doc = StarDesktop.loadComponentFromURL(Url, "MyFrame", _SearchFlags, Dummy)
FileN=FileNameoutofPath(Url)
BaseFilename = Tools.Strings.GetFileNameWithoutExtension(FileN)
DirLoc=DirectoryNameoutofPath(ConvertFromUrl(Url),"/")+"/"
Sheets = Doc.Sheets
NumSheets = Sheets.Count - 1
For J = 0 to NumSheets
SheetName = Sheets(J).Name
if (SheetName = SheetNameSeek) then
Doc.getCurrentController.setActiveSheet(Sheets(J))
Filename = DirLoc + BaseFilename + "."+ SheetName + ".csv"
FileURL = convertToURL(Filename)
Doc.StoreAsURL(FileURL, Propval())
end if
Next J
Doc.close(true)
NextFile = Dir
End Sub
I ended up using xlsx2csv
Version 0.7.8 supports general xlsx files pretty well. It allows to specify the tab by number and by name.
It does not do a good job on macros and complication multi-sheet documents, but it does a very good job on regular multi-sheet xlsx documents.
Unfortunately, xlsx2csv does not support password protected xlsx, so for that I still have to use Win32::OLE Perl module and run it on Windows environment.
From what I can see Libreoffice still does not have the ability to select the tab via command line.

Unix diff with custom line separator

Looking to compare two CSV files. Suppose the field separator is $, each record has two fields, and the file can be formatted something like:
a$simple line$
b$run-on-
line$
c$simple line$
Is there some switch or variety of Unix diff command that will let me run the comparison where the record separator (line separator) is the $ sign immediately followed by a new line?
Ideally I want to be guaranteed that diff outputs the entire record when any change is detected.
With the default behavior, I could potentially get a partial record as diff output (in scenarios where the record runs over several lines).
Is there some smarter way to do this that I'm not considering?
--
Edited to add: sample of expected output
If I compared the CSV file above with:
a$simple line$
b$run-on-changed-
line$
c$simple line$
... I would want to see the entire record b reported as a difference. Something like
2c2
< b$run-on-\nline$
---
> b$run-on-changed-\nline$
Peter, there is no direct support of custom line separator in gnu diff: http://man7.org/linux/man-pages/man1/diff.1.html (gnu diffutils)
You may try to use sed twice: sed to convert your format to one-record-per-line for diffing; diff converted files; sed back to multiline record format.
First sed will convert all $\n to real \n; and \n without $ before it to some unique special sequence, like #%#$%#$%#$#.
Then do diff.
Second sed will convert #%#$%#$%#$# back to \n (or to \\n to easier viewing of diff output)
There are diff variants which support working with csv. Some of them may handle csv with line breaks inside fields:
https://pypi.python.org/pypi/csvdiff (python)
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
https://github.com/agardiner/csv-diff (ruby)
Unlike a standard diff that compares line by line, and is sensitive to the ordering of records, CSV-Diff identifies common lines by key field(s), and then compares the contents of the fields in each line.
http://csvdiff.sourceforge.net/ (perl)
csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occours and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator.

Compare CSV files

I am currently using a windows utility called TableTexCompare
This tool can take 2 CSV files and compare them. The nice thing about it is that it can make the comparison even if the records of the 2 files are not sorted in the same order or the fields are not positioned in the same order.
As such, the following 2 files would result in a successful comparison
(File1.csv)
FirstName,LastName,Age
Mona,Sax,30
Max,Payne,43
Jack,Lupino,50
(File2.csv)
FirstName,Age,LastName
Max,43,Payne
Jack,50,Lupino
Mona,30,Sax
What I am looking for is to do the same thing from the command-line with just 1 difference:
I would like the comparison to be performed in one direction only, i.e. if File2.csv is as follows (a subset of File1.csv), the comparison should pass
(File2.csv)
FirstName,Age,LastName
Jack,50,Lupino
I do not particularly care if it's going to be in some programming language, a dedicated cli tool or a shell script (e.g. using awk). I have some experience with Java and Groovy but would like to be pointed to some initial direction.
I can offer a Python solution:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
for item in r2:
if not item in r1:
print("r2 is not a subset of r1!")
break
This is actually a bit more verbose than necessary in Python (but easier to understand); I personally would have used a generator expression:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
if all(item in r1 for item in r2):
print("r2 is a subset of r1")
If you can afford to do a case insensitive comparison, and if there are no duplicates within File2.csv that must be matched within File1.csv, and if File1.csv does not contain \\ or \", then all you need is a simple FINDSTR command.
The following will list lines in File2.csv that do not appear in File1.csv:
findstr /vxig:"File1.csv" "File2.csv"
If all you want is an indication whether File1.csv is a superset of File2.csv, then
findstr /vxig:"File1.csv" "File2.csv" >nul && (echo File1 is NOT a superset of File2) || (echo File1 IS a superset of File2)
The search should not have to be case insensitive, except there is a nasty FINDSTR bug: it may fail to find matches when there are multiple case sensitive literal search strings of varying size. The case insensitive option avoids the bug. See Why doesn't this FINDSTR example with multiple literal search strings find a match? for more info.
The search will not work properly if File2.csv contains \\ or \" because FINDSTR will treat them as \ and " respectively. See What are the undocumented features and limitations of the Windows FINDSTR command? for more info. The accepted answer has sections describing FINDSTR escape sequences about half way down.
You can take a look at q - Text as a Database , which allows executing SQL directly on csv files, including joins. This will allow doing a compare easily, and much more, such as matching specific columns for equality, and getting specific columns from rows that don't match etc.
Full disclosure - It's my own open source tool.
Harel Ben-Attia

Natural ordering files in directory into a cell array using Octave

I have files being generated by another program/user that have names such as "jh-1.txt, jh-2.txt, ..., jh-100.txt, ..., jh-1024.txt". I'm extracting a column from these files, manipulating the data, and outputting to a new matrix. The only problem is that Octave is using ASCII ordering and not natural ordering when reading in the files. Thus, the output matrix is not ordered in a natural way. My question is, can Octave sort file names in a natural order? I'm getting file names in the standard method:
fileDirectory = '/path/to/directory';
filePattern = fullfile(fileDirectory, '*.txt'); % Selects only the txt files.
dataFiles = dir(filePattern); % Gets the info from the txt files in the directory.
baseFileName = {dataFiles.name}'; % Gets all the txt file names.
I can't rename the files because this is a script for another user. They are on a Windows machine and already have Octave installed with Cygwin and I don't want to make them use the command line more than they have to because they are unfamiliar with it. Alternatively, it would be nice to have the output with the file names in a column but, I haven't figured that one out either (bit of a noob with Octave myself). That way the user could use Excel (which they are familiar with) to sort the columns.
I don't think there's a built in natural sort in Octave. However, there is a natural sort submission on Mathwork's File Exchange. I've not used it, but the comments imply it works in Octave too.