I have a number of CSV files with data in the first three columns only. I want to copy data from each CSV file and paste it into one single CSV file in column order. For example data from the first CSV file goes into columns 1,2 and 3 in the output file. Similarly, data from the 2nd CSV goes to columns 4,5, and 6 of the same output CSV file and so on. Any help would be highly appreciated. Thanks.
I have tried the following code but it gets me the output in same columns only.
import glob
import pandas as pd
import time
import numpy as np
start = time.time()
Filename='Combined_Data.csv'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
for i in range(len(all_filenames)):
data= pd.read_csv(all_filenames[i],skiprows=23)
data= data.rename({'G1': 'CH1', 'G2': 'CH2','Dis': 'CH3'},axis=1)
data= data[['CH1','CH2','CH3']]
data= data.apply(pd.to_numeric, errors='coerce')
print(all_filenames[i])
if i == 0:
data.to_csv(Filename,sep=',',index=False,header=True,mode='a')
else:
data.to_csv(Filename,sep=',',index=False,header=False,mode='a')
end = time.time()
print((end - start),'Seconds(Execution Time)')
If you don't need to write your own code for this, I'd recommend GoCSV's zip command; it can also handle the CSVs having different numbers of rows.
I have three CSV files:
file1.csv
Dig1,Dig2,Dig3
1,2,3
4,5,6
7,8,9
file2.csv
Letter1,Letter2,Letter3
a,b,c
d,e,f
and file3.csv
RomNum1,RomNum2,RomNum3
I,II,III
When I run gocsv zip file2.csv file1.csv file3.csv I get:
Letter1,Letter2,Letter3,Dig1,Dig2,Dig3,RomNum1,RomNum2,RomNum3
a,b,c,1,2,3,I,II,III
d,e,f,4,5,6,,,
,,,7,8,9,,,
GoCSV is pre-built for a number of different OS'es.
Here's how to do it with Python's CSV module, using these files:
file1.csv
Dig1,Dig2,Dig3
1,2,3
4,5,6
7,8,9
file2.csv
Letter1,Letter2,Letter3
a,b,c
d,e,f
and file3.csv
RomNum1,RomNum2,RomNum3
I,II,III
The more-memory-intensive option
This accumulates the final CSV one file at a time, expanding a list that represents the final CSV with with each new input CSV.
#!/usr/bin/env python3
import csv
import sys
csv_files = [
'file2.csv',
'file1.csv',
'file3.csv',
]
all = []
for csv_file in csv_files:
with open(csv_file) as f:
reader = csv.reader(f)
rows = list(reader)
len_all = len(all)
# First file, initialize all and continue (skip)
if len_all == 0:
all = rows
continue
# The number of columns in all so far
len_cols = len(all[0])
# Extend all with the new rows
for i, row in enumerate(rows):
# Check to make sure all has as many rows as this file
if i >= len_all:
all.append(['']*len_cols)
all[i].extend(row)
# Finally, pad all rows on the right
len_cols = len(all[0])
for i in range(len(all)):
len_row = len(all[i])
if len_row < len_cols:
col_diff = len_cols - len_row
all[i].extend(['']*col_diff)
writer = csv.writer(sys.stdout)
writer.writerows(all)
The streaming option
This reads-and-writes a line/row at a time.
(this is basically a Python port of the Go code from GoCSV's zip, from above)
import csv
import sys
fnames = [
'file2.csv',
'file1.csv',
'file3.csv',
]
num_files = len(fnames)
readers = [csv.reader(open(x)) for x in fnames]
# Collect "header" lines; each header defines the number
# of columns for its file
headers = []
num_cols = 0
offsets = [0]
for reader in readers:
header = next(reader)
headers.append(header)
num_cols += len(header)
offsets.append(num_cols)
writer = csv.writer(sys.stdout)
# With all headers counted, every row must have this many columns
shell_row = [''] * num_cols
for i, header in enumerate(headers):
start = offsets[i]
end = offsets[i+1]
shell_row[start:end] = header
# Write headers
writer.writerow(shell_row)
# Expect that not all CSVs have the same number of rows; some will "finish" ahead of others
file_is_complete = [False] * num_files
num_complete = 0
# Loop a row at a time...
while True:
# ... for each CSV
for i, reader in enumerate(readers):
if file_is_complete[i]:
continue
start = offsets[i]
end = offsets[i+1]
try:
row = next(reader)
# Put this row in its place in the main row
shell_row[start:end] = row
except StopIteration:
file_is_complete[i] = True
num_complete += 1
except:
raise
if num_complete == num_files:
break
# Done iterating CSVs (for this row), write it
writer.writerow(shell_row)
# Reset for next main row
shell_row = [''] * num_cols
For either, I get:
Letter1,Letter2,Letter3,Dig1,Dig2,Dig3,RomNum1,RomNum2,RomNum3
a,b,c,1,2,3,I,II,III
d,e,f,4,5,6,,,
,,,7,8,9,,,
Related
I have a .txt file with 683,500 rows, every 7 rows its a different person that contain:
ID
Name
Work position
Date 1 (year - month)
Date 2 (year - month)
Gross payment
Service time
I would like to read that .txt and output (could be a json, csv, txt, or even in a database) every person in a 7 column, for example:
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
Example in the txt:
00000000886
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-08
2021-09
30,556.04
15.7
00000000086
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-01
2021-09
30,556.04
15.7
00100000086
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-01
2021-09
30,556.04
15.7
import csv
#opening file
file = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt") #open file
counter = 0
total_lines = len(file.readlines()) #count lines
#print('Total lines:', x)
#reading from file
content = file.read()
colist = content.split ()
print(colist)
#read data from data1.txt and write in data2.txt
lines = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt")
arr = []
with open('data2.txt', 'w') as f:
for line in lines:
#arr.append(line)
f.write (line)
I'm new to programing and I don't know how to translate my logic to code.
Your code does not collect multiple lines to write them into one.
Use this approach:
read your file line by line
collect each line without a \n into a list
if list reaches 7 length, write into csv and clear list
repeat until done
Create data file:
with open ("t.txt","w") as f:
f.write("""00000000886\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-08\n2021-09\n30,556.04\n15.7
00000000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7
00100000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7""")
Program:
import csv
with open("t.csv","w",newline="") as wr, open("t.txt") as r:
# create a csv writer
writer = csv.writer(wr)
# uncomment if you want a header over your data
# h = ["ID","Name","Work position","Date 1","Date 2",
# "Gross payment","Service time"]
# writer.writerow(h)
person = []
for line in r: # could use enumerate as well, this works ok
# collect line data minus the \n into list
person.append(line.strip())
# this person is finished, write, clear list
if len(person) == 7:
# leveraged the csv module writer, look it up if you need
# to customize it further regarding quoting etc
writer.writerow(person)
person = [] # reset list for next person
# something went wrong, your file is inconsistent, write remainder
if person:
writer.writerow(person)
print(open("t.csv").read())
Output:
00000000886,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-08,2021-09,"30,556.04",15.7
00000000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
00100000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
Readup: csv module - writer
The "Gross payment" needs to be quoted because it contain s a ',' wich is the delimiter for csv - the module does this automagically.
On top of the excellent answer from #PatrickArtner, I would like to propose an itertools-based solution:
import csv
import itertools
def file_grouper_itertools(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,\
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
args = [iter(in_file)] * size
for block in itertools.zip_longest(*args, fillvalue=' '):
# equivalent, for the given input, to:
# block = [x.rstrip('\n') for x in block]
block = ''.join(block).rstrip('\n').split('\n')
writer.writerow(block)
The idea there is to loop in blocks of the required size.
For larger group sizes this gets faster simply because of the fewer cycles the main loop is being executed.
Running some micro-benchmarking shows that your use case should benefit from this approach compared to the manual looping (adapted into a function):
import csv
def file_grouper_manual(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,\
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
block = []
for line in in_file:
block.append(line.rstrip('\n'))
if len(block) == size:
writer.writerow(block)
block = []
if block:
writer.writerow(block)
Benchmarking:
n = 100_000
k = 7
with open ("t.txt", "w") as f:
for i in range(n):
f.write("\n".join(["0123456"] * k))
%timeit file_grouper_manual()
# 1 loop, best of 5: 325 ms per loop
%timeit file_grouper_itertools()
# 1 loop, best of 5: 230 ms per loop
Alternatively, you could use Pandas, which is very convenient, but requires that all the input fit into available memory (which should not be a problem in your case, but can be for larger inputs):
import numpy as np
import pandas as pd
def file_grouper_pandas(in_filepath="t.txt", out_filepath="t.csv", size=7):
with open(in_filepath) as in_filepath:
data = [x.rstrip('\n') for x in in_filepath.readlines()]
df = pd.DataFrame(np.array(data).reshape((-1, size)), columns=list(range(size)))
# consistent with the other solutions
df.to_csv(out_filepath, header=False, index=False)
%timeit file_grouper_pandas()
# 1 loop, best of 5: 666 ms per loop
If you do a lot of work with tables and data, NumPy and Pandas are really useful libraries to get comfortable with.
import numpy as np
import pandas as pd
columns = ['ID', 'Name' , 'Work position', 'Date 1 (year - month)', 'Date 2 (year - month)',
'Gross payment', 'Service time']
with open('oldfile.txt', 'r') as stream:
# read file into a list of lines
lines = stream.readlines()
# remove newline character from each element of the list.
lines = [line.strip('\n') for line in lines]
# Figure out how many rows there will be in the table
number_of_people = len(lines)/7
# Split data into rows
data = np.array_split(lines, number_of_people)
# Convert data to pandas dataframe
df = pd.DataFrame(data, columns = columns)
Once you have converted the data to a Pandas Dataframe, you can easily output it to any of the formats you listed. For example to output to csv you can do:
df.to_csv('newfile.csv')
Or for json it would be:
df.to_json('newfile.csv')
Hi i have this huge 14Gb CSV file with entries that span multiple lines and would like a easy way to split it, BTW the split command will not work cause it is not aware of how many columns there are on a row and will cut it wrong.
Using XSV (https://github.com/BurntSushi/xsv) is very simple:
xsv split -s 10000 ./outputdir inputFile.csv
-s 10000 to set the number of records to write into each chunk.
import os
import pandas as pd
import numpy as np
data_root = r"/home/glauber/Projetos/nlp/"
fname = r"blogset-br.csv.gz"
this_file = os.path.join(data_root,fname)
assert os.path.exists(this_file), this_file
this_file
column_names = ['postid', 'blogid', 'published', 'title', 'content', 'authorid', 'author_displayName', 'replies_totalItems', 'tags']
parse_dates = ['published']
df_iterator = pd.read_csv(this_file,
skiprows=0,
compression='gzip',
chunksize=1000000,
header=None,
names = column_names,
parse_dates=parse_dates,
index_col=1)
new_df = pd.DataFrame()
count = 0
for df in df_iterator:
filename = 'blogset-br-' + str(count ) + '.csv'
df.to_csv(filename)
count += 1
this is the easiest way i could find
I am trying to convert csv files in a folder to a single json file. Below code does the job, but the issue is, json file has the first csv written several times. Below is the code i tried. I guess i am going wrong with assigning the data variable. Help me fix it
import csv, json, os
dir_path = 'C:/Users/USER/Desktop/output_files'
inputfiles = [file for file in os.listdir(dir_path) if file.endswith('.csv')]
outputfile = "data_backup1.json"
for file in inputfiles:
filepath = os.path.join(dir_path, file)
data = {}
with open(filepath, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
id = row['ID']
data[id] = row
with open(outputfile, "a") as jsonfile:
jsonfile.write(json.dumps(data, indent=4))
Expected output: Json file needs to have each csv written only once into it.
if your .csv files and all of the rows do have different ['ID']s, your assigned dictionary keys should be unique. In this case, your dictionary is growing with one entry per reader .csv row.
You have to change the indentation of the jsonfile.write() function as shown below to produce just one .json file. To sort your entries you could add sort_keys=True in this function.
for file in inputfiles:
filepath = os.path.join(dir_path, file)
data = {}
with open(filepath, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
id = row['ID']
data[id] = row
with open(outputfile, "a") as jsonfile:
jsonfile.write(json.dumps(data, indent=4, sort_keys=True))
I want to open multiple csv files (with same data types/columns), save the data into one variable do some stuff to data and save it into one csv file. While I can easily open one file, I can't seem to find a way to open multiple files. Here is my code:
import numpy as np
import csv
from collections import Counter
files = ['11.csv', '12.csv', '13.csv', '14.csv', '15.csv']
with open(files) as csvfile:
info = csv.reader(csvfile, delimiter=',')
info_types = []
records = 0
for row in info:
records = row[2]
call_types.append(records)
stats = Counter(call_types).most_common()
print(stats)
results = stats
resultFile = open("Totals.csv",'w')
wr = csv.writer(resultFile, dialect='excel')
for output in results:
wr.writerow(output)
To make it work, simultaneouly less bug prone and efficient try the following.
# required imports
files = ['11.csv', '12.csv', '13.csv', '14.csv', '15.csv']
with open("outfile","wt") as fw:
writer = csv.writer(fw)
for file in files:
with open(file) as csvfile:
info = csv.reader(csvfile, delimiter=',')
info_types = []
records = 0
for row in info:
# process row but don't store it
# in any list if you
# don't have to(that will defeat the purpose)
# say you get processed_row
writer.writerow(processed_row)
I would do this within a loop. Since you are already appending the data as you are reading from the file.
for f in files:
with open(f) as csvfile:
...
I want to import following csv file data in aerospike and want to fire simple select query to display data using python as a client
e.g
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
query = client.query( 'test', 'csvfile' )
query.select( 'policyID', 'statecode' )
You could try to use the python csv module along with Aerospike Python client:
https://docs.python.org/2/library/csv.html
http://www.aerospike.com/docs/client/python/
And do something similar to the following:
import aerospike
import sys
import csv
global rec
rec = {}
csvfile = open('aerospike.csv', "rb")
reader = csv.reader(csvfile)
rownum = 0
for row in reader:
# Save First Row with headers
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
# print (rownum,header[colnum],col)
rec[header[colnum]] = col
colnum += 1
rownum += 1
# print(rownum,rec)
if rec:
client.put(('test', 'demo', str(rownum)), rec)
rec = {}
csvfile.close()
Note: You may need to check the size of your headers and make sure they do not exceed 14 characters.
if not you could get the following:
error: (21L, 'A bin name should not exceed 14 characters limit', 'src/main/conversions.c', 500)
as far as I am aware there is no pre-built tool other than the loader that allows you to import CSV. You could, perhaps, build one using the existing client tools.