I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:
x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)
I need to split this CSV file into multiple files which contain a less number of columns like:
# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)
# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)
#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)
Is there any program or library doing this?
I've googled for it , but programs that I found only split a file by rows not by columns.
Or which language could I use to do this efficiently?
I can use R, shell script, Python, C/C++, Java
A one-line solution for your example data and desired output:
cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv
The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.
update in python, since the OP asked for a program in an acceptable language:
# python 3 (or python 2, if you must)
import csv
import fileinput
output_specifications = ( # csv file name, selector function
('file1.csv', slice(3)),
('file2.csv', slice(3, 1003)),
('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
(
csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
selector,
) for file_name, selector in output_specifications
]
reader = csv.reader(fileinput.input())
for row in reader:
for row_writer, selector in output_row_writers:
row_writer(row[selector])
This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.
Use a small python script like:
fin = 'file_in.csv'
fout1 = 'file_out1.csv'
fout1_fd = open(fout1,'w')
...
lines = []
with open(fin) as fin_fd:
lines = fin_fd.read().split('\n')
for l in lines:
l_arr = l.split(',')
fout1_fd.write(','.join(l_arr[0:3]))
fout1_fd.write('\n')
...
...
fout1_fd.close()
...
You can open the file in Microsoft Excel, delete the extra columns, save as csv for file #1. Repeat the same procedure for the other 2 tables.
I usually use open office ( or microsof excel in case you are using windows) to do that without writing any program and change the file and save it. Following are two useful links showing how to do that.
https://superuser.com/questions/407082/easiest-way-to-open-csv-with-commas-in-excel
http://office.microsoft.com/en-us/excel-help/import-or-export-text-txt-or-csv-files-HP010099725.aspx
Related
I have multiple CSV files that I need to split into 67 separate files each. Each sheet has over a million rows and dozens of columns. One of the columns is called "Code" and it ranges from 1 to 67 which is what I have to base the split on. I have been doing this split manually by selecting all of the rows within each value (1, 2, 3, etc) and pasting them into their own CSV file and saving them, but this is taking way too long. I usually use ArcGIS to create some kind of batch file split, but I am not having much luck in doing so this go around. Any tips or tricks would be greatly appreciated!
If you have access to awk there's a good way to do this.
Assuming your file looks like this:
Code,a,b,c
1,x,x,x
2,x,x,x
3,x,x,x
You want a command like this:
awk -F, 'NR > 1 {print $0 >> "code" $1 ".csv"}' data.csv
That will save it to files like code1.csv etc., skipping the header line.
I am currently working with datasets collected in large CSV files (over 1600 columns and 100 rows). Excel or LibreOffice calc can't easily handle these files for concatenating a prefix or suffix to the header row, which is what I would have done on a smaller dataset.
Researching the topic I was able to come up with the following command:
awk 'BEGIN { FS=OFS="," } {if(NR==1){print "prefix_"$0}; if(NR>1){print; next}}' input.csv >output.csv
Unfortunately, this only adds the prefix to the first cell. For example:
Input:
head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
Expected Output:
prefix_head_1,prefix_head_2,prefix_head_3,[...],prefix_head_n
"value_1","value_2","value_3",[...],"value_n"
Real Output:
prefix_head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
As the column number may be variable across different csv files, I would like a solution that doesn't require enumeration of all columns as found elsewhere.
This is necessary as the following step is to combine various (5 or 6) large csv files in a single csv database by combining all columns (the rows refer to the same instances, in the same order, across all files).
Thanks in advance for your time and help.
awk 'BEGIN{FS=OFS=","} NR==1{for (i=1;i<=NF;i++) $i="prefix_"$i} 1' file
The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches
I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
this is my first question on here. I work as a meteorologist and have some coding experience, though it is far from professionally taught. Basically what I have is a .csv file from a weather station that is giving me data that is too detailed. (65.66 degrees and similar values) What I want to do is automate a way via a script file that would access the .csv file and get rid of values that were too detailed. (Take a temp from 65.66 to 66 (rounding up for anything above .5 and down for below) or for a pressure (29.8889) and making it (29.89) using the same rounding rules.) Is this possible to be done? If so how should I go about it. Again keep in mind that my coding skills for batch files are not the strongest.
Any help would be much appreciated.
Thanks,
I agree with the comments above. Math in batch is limited to integers, and won't work well for the manipulations you want.
I'd use PowerShell. Besides easily handling floating point math, it also has built-in methods for objectifying CSV data (as well as XML and other types of structured data). Take the following hypothetical CSV data contained within weather.csv:
date,time,temp,pressure,wx
20160525,12:30,65.66,30.1288,GHCND:US1TNWS0001
20160525,13:00,67.42,30.3942,GHCND:US1TNWS0001
20160525,13:30,68.92,31.0187,GHCND:US1TNWS0001
20160525,14:00,70.23,30.4523,GHCND:US1TNWS0001
20160525,14:30,70.85,29.8889,GHCND:US1TNWS0001
20160525,15:00,69.87,28.7384,GHCND:US1TNWS0001
The first thing you want to do is import that data as an object (using import-csv), then round the numbers as desired -- temp rounded to a whole number, and pressure rounded to a precision of 2 decimal places. Rounding to a whole number is easy. Just recast the data as an integer. It'll be rounded automatically. Rounding the pressure column is pretty easy as well if you invoke the .NET [math]::round() method.
# grab CSV data as a hierarchical object
$csv = import-csv weather.csv
# for each row of the CSV data...
$csv | foreach-object {
# recast the "temp" property as an integer
$_.temp = [int]$_.temp
# round the "pressure" property to a precision of 2 decimal places
$_.pressure = [math]::round($_.pressure, 2)
}
Now pretend you want to display the temperature, barometric pressure, and weather station name where "date" = 20160525 and "time" = 14:30.
$row = $csv | where-object { ($_.date -eq 20160525) -and ($_.time -eq "14:30") }
$row | select-object pressure,temp,wx | format-table
Assuming "pressure" started with a value of 29.8889 and "temp" had a value of 70.85, then the output would be:
pressure temp wx
-------- ---- --
29.89 71 GHCND:US1TNWS0001
If the CSV data had had multiple rows with the same date and time values (perhaps measurements from different weather stations), then the table would display with multiple rows.
And if you wanted to export that to a new csv file, just replace the format-table cmdlet with export-csv destination.csv
$row | select-object pressure,temp,wx | export-csv outfile.csv
Handy as a pocket on a shirt, right?
Now, pretend you want to display the human-readable station names rather than NOAA's designations. Make a hash table.
$stations = #{
"GHCND:US1TNWS0001" = "GRAY 1.5 E TN US"
"GHCND:US1TNWS0003" = "GRAY 1.9 SSE TN US"
"GHCND:US1TNWS0016" = "GRAY 1.3 S TN US"
"GHCND:US1TNWS0018" = "JOHNSON CITY 5.9 NW TN US"
}
Now you can add a "station" property to your "row" object.
$row = $row | select *,"station"
$row.station = $stations[$row.wx]
And now if you do this:
$row | select-object pressure,temp,station | format-table
Your console shows this:
pressure temp station
-------- ---- -------
29.89 71 GRAY 1.5 E TN US
For extra credit, say you want to export this row data to JSON (for a web page or something). That's slightly more complicated, but not impossibly so.
add-type -AssemblyName System.Web.Extensions
$JSON = new-object Web.Script.Serialization.JavaScriptSerializer
# convert $row from a PSCustomObject to a more generic hash table
$obj = #{}
# the % sign in the next line is shorthand for "foreach-object"
$row.psobject.properties | %{
$obj[$_.Name] = $_.Value
}
# Now, stringify the row and display the result
$JSON.Serialize($obj)
The output of that should be similar to this:
{"station":"GRAY 1.5 E TN US","wx":"GHCND:US1TNWS0001","temp":71,"date":"201605
25","pressure":29.89,"time":"14:30"}
... and you can redirect it to a .json file by using > or pipe it into the out-file cmdlet.
DOS batch scripting is, by far, not the best place to edit text files. However, it is possible. I will include sample, incomplete DOS batch code at the bottom of this post to demonstrate the point. I recommend you focus on Excel (no coding needed) or Python.
Excel - You don't need to code at all with Excel. Open the csv file. Let's say you have 66.667 in cell B12. In cell C12 enter a formula using the round function (code below). You can also teach yourself some Visual Basic for Applications. But, for this simple task, that is overkill. When done, if you save as csv format, you will loose your formulae and only have data. Consider saving as xlsx or xlsm.
Visual Basic Script - you can run vbscript on your machine with
cscript.exe (or wscript.exe), which is part of Windows. But, if using VB script, you might as well use VBA in Excel. It is almost identical.
Python is a very high level langauge with built in libraries
that make editing a csv file super easy. I recommend Anaconda
(a Python suite) from continuum.io. But, you can find the generic Python at
python.org as well. Anaconda will come prepackaged with lots of
helpful libraries. For csv editing, you will likely want to use the
pandas library. You can find plenty of short videos on YouTube.
Excel
Say you have 66.667 in cell B12. Set the formula in C13 to...
"=ROUND(B12,0)" to round to integer
"=ROUND(B12,1)" to round to one decimal place
As you copy and past, Excel will attempt to intelligently update the formulas for you.
Python
import pandas as pd
from StringIO import StringIO
import numpy as np
# load csv file to memory. Name your columns "using names=[]"
df = pd.read_csv(StringIO("C:/temp/weather.csv"), names=["city", "temperature", "date"])
df["temperature"].apply(np.round) #you just rounded the temperature column
pd.to_csv('newfile.csv') # export to a new csv file
pd.to_xls('newfile.xls') # or export to an excel file instead
DOS Batch
A Batch script for this is much, much harder. I will not write the whole program, because it is not a great solution. But, I'll give you a taste in DOS batch code at the bottom of this post. Compared to using Python or Excel, it is extremely complex.
Here is a rough sketch of DOS code. Because I don't recommend this method, I didn't take the time to debug this code.
setlocal ENABLEDELAYEDEXPANSION
:: prep our new file for output. Let's write the header row.
echo col1, col2, col3 >newfile.csv
:: read the existing text file line by line
:: since it is csv, we will parse on comma
:: skip lines starting with semi-colon
FOR /F "eol=; tokens=2,3* delims=, " %%i in (input_file.txt) do (
set col1=%%I, set col2=%%J, set col3=%%K
:: truncate col2 to 1 decimal place
for /f "tokens=2 delims==." %%A in ("col2") do (
set integer=%%A
set "decimal=%%B
set decimal=%decimal:~0,1%
:: or, you can use an if statement to round up or down
:: Now, put the integer and decimal together again and
:: redefine the value for col2.
set col2=%integer%.%decimal%
:: write output to a new csv file
:: > and >> can redirect output from console to text file
:: >newfile.csv will overwrite file.csv. We don't want
:: that, since we are in a loop.
:: >>newfile.csv will append to file.csv, perfect!
echo col1, col2, col3 >>newfile.csv
)
)
:: open csv file in default application
start myfile.csv