How to combine thousands of .csv files into one master file? - csv

I have a folder of a little over 10,000 .csv files that I want to combine into one master file. They are all categorized the same way (Column A B C D E F are the same thing in each file). I'd prefer to do it in a shell script.
I tried
cat *.csv > Everything.csv
and it returns that Argument is too long
I also tried
copy *.csv > Everything.csv
and it returns the same error.
How do I get it to combine about 10,000 files into one Master file?

This question discusses the error you're seeing: Argument list too long error for rm, cp, mv commands
One possible solution would be something like:
find . -name "*.csv" -exec cat '{}' >> ./Everything.csv ';'

I've done it using:
cat *.csv >> /tmp/master_file_name.csv
Single '>' will re-write the file, not append it,

You can use a simple for loop:
for file in `ls`; do
cat $file >> master_file
done

find . -name "*.csv" | xargs -I{} cat '{}' > Everything.csv
(edit) Ah well, beat to the punch...

Related

How to format a TXT file into a structured CSV file in bash?

I wanted to get some information about my CPU temperatures on my Linux Server (OpenSuse Leap 15.2). So I wrote a Script which collects data every 20 seconds and writes it into a text file. Now I have removed all garbage data (like "CPU Temp" etc.) I don't need.
Now I have a file like this:
47
1400
75
3800
The first two lines are one reading of the CPU temperature in C and the fan speed in RPM, respectively. The next two lines are another reading of the same measurements.
In the end I want this structure:
47,1400
75,3800
My question is: Can a Bash script do this for me? I tried something with sed and Awk but nothing worked perfectly for me. Furthermore I want a CSV file to make a graph, but i think it isn't a problem to convert a text file into a CSV file.
You could use paste
paste -d, - - < file.txt
With pr
pr -ta2s, file.txt
with ed
ed -s file.txt <<-'EOF'
g/./s/$/,/\
;.+1j
,p
Q
EOF
You can use awk:
awk 'NR%2{printf "%s,",$0;next;}1' file.txt > file.csv
Another awk:
$ awk -v OFS=, '{printf "%s%s",$0,(NR%2?OFS:ORS)}' file
Output:
47,1400
75,3800
Explained:
$ awk -v OFS=, '{ # set output field delimiter to a comma
printf "%s%s", # using printf to control newline in output
$0, # output line
(NR%2?OFS:ORS) # and either a comma or a newline
}' file
Since you asked if a bash script can do this, here's a solution in pure bash. ;o]
c=0
while read -r line; do
if (( c++ % 2 )); then
echo "$line"
else printf "%s," "$line"
fi
done < file
Take a look at 'paste'. This will join multiple lines of text together into a single line and should work for what you want.
echo "${DATA}"
Name
SANISGA01CI
5WWR031
P59CSADB01
CPDEV02
echo "${DATA}"|paste -sd ',' -
Name,SANISGA01CI,5WWR031,P59CSADB01,CPDEV02

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

Finding all kinds of extensions referenced in a html file

Here is my problem statement :
There is a folder with many html and text files. I need to recursively go through each one of them and find all kinds of file extensions referenced in these html/text files like .jpg, .tif, .png etc
The problem is I don't have a defined list of the extensions I want to search for.
What would be the best way to achieve this using a shell script ?
Coming up with a Reg-ex which would essentially search for all occurrences of a dot followed by 3 or 4 letters, and filtering out the ones which end with a space or a comma, or a quote etc ??
Any suggestions would be helpful.
You can use shell script to parse file name with regex, but straight forward version is pretty simple:
$ cat *.{txt,html} | grep -oP '\b[A-Za-z0-9_]+\.[A-Za-z0-9]{1,4}\b' | awk -F. '{ print "." $(NF) }' | sort -u
For recursive search:
find . -name '*.txt' -or -name '*.html' -exec grep -oP '\b[A-Za-z0-9_.]+\.[A-Za-z0-9]{1,4}\b' {} \; | awk -F. '{ print "." $(NF) }' | sort -u

linux command line to zip files based on mysql resultset

i have a table, where some filenames are stored.
i would like to find all the files having that name under a specific folder and zip all of them.
on disk the structure is similar to this:
/folder/sub1/file1
/folder/sub1/file2
/folder/sub2/file1 <- same name as under sub1
/folder/sub2/file2
so i am looking for something similar to:
mysql -e "select file from table" | find /folder -type f -name <the value of file from mysql result set> | zip <all files found by all find commands>
thanks.
Couple of additions to your command:
Firstly, you want to use mysql in batch mode, so you do this:
mysql -Be "select file from table"
It gives you a single column table with no borders, so you get rid of the headers by piping it to tail starting at the second line:
tail -n +2
Then you pipe that to xargs, but before you do, hack it a bit with concat (you'll see why in a sec):
mysql -Be "select concat(' -o -name ', file) from table"
NOW you pipe it to xargs:
xargs find /folder -false
This does a "false" test (i.e. a no-op), but it appends a whole pile of things like -o -name somename.file, each of which performs a boolean or (with false originally, later with all other file names) and ultimately returns the list of files that match.
...which you finally pipe to zip, with another xargs:
xargs zip files.zip
Again, this puts the file names as arguments to zip.
Here's the total line:
mysql -Be "select concat(' -o -name ', file) from table" | tail -n +2 | xargs find /folder -false | xargs zip files.zip
Bear in mind that this assumes you have no spaces in your filenames. If you do, that'll add a bit of complexity: You can work around that by using -print0 and -0 in find and xargs respectively, although zip will have a harder time with that so you'd need to add another intermediate stage (or use zip -r).

Mass editing CSV files through command line?

I have a huge list of CSV files in a certain directory.
I need to change the field for A1 on all the CSV files to this: Email
Is there any way to do this all in one command to all the files?
Or if this is easier: I just need Email to be the first line of each file so if there's a way to massively insert Email as the first line in each file that'll work perfect too!
Here's quick and dirty example for you:
replace="Email"
path="./"
ext="csv"
for f in $path*.$ext
do
search=$(head -1 $f | awk '{print $1}')
echo Changing: "$f"
sed -i -e "s/$search/$replace/" "$f" && echo Done
done