Python: copy and paste files based on paths in two csvs - csv

I have two csv files, one with a list of paths for source files, the second, a list of paths for where to copy the files to. Both files have the same number of elements and each source file is only copied once.
How would I load the .csv files (Pandas? Numpy? csv.reader?), and how would I copy all of the items in the best possible way? I am able to get the following to work if src and dest each refer to one path.
import pandas as pd
srcdf = pd.read_csv('src.csv')
destdf = pd.read_csv('dest.csv')
from shutil import copyfile
copyfile(src,dest)
There are no headers or columns in my files. It's just a vector of comma-separated values. The comma-separated values in my src csv file are look like:
/Users/johndoe/Downloads/50.jpg,
/Users/johndoe/Downloads/51.jpg,
In my dest csv file are like:
/Users/johndoe/Downloads/newFolder/50.jpg,
/Users/johndoe/Downloads/newFolder/51.jpg,

Assuming your CSV is just a list of paths with a single path on each row, you could do something like this:
import csv
from shutil import copyfile
def load_paths(filename):
pathdict = {}
with open(filename) as csvfile:
filereader = csv.reader(csvfile, delimiter=' ')
a = 0
for row in filereader:
pathdict[a] = ''.join(row)
a += 1
csvfile.close()
return pathdict
srcpaths = load_paths('srcfile.csv')
dstpaths = load_paths('dstfile.csv')
for a in range(len(srcpaths)):
copyfile(srcpaths[a],dstpaths[a])

You can use numpy genfromtxt as follows,
import numpy as np
from shutil import copyfile
srcdf = np.genfromtxt('./src.csv', dtype='S')
destdf = np.genfromtxt('./dest.csv', dtype='S')
assert len(srcdf) == len(destdf)
for n in range(len(srcdf)):
copyfile(srcdf[n],destdf[n])

Related

Azure Bicep - load excel file in Bicep

I would like to load the values from excel file, they are only names inside it and I have a lot of them. So I don't want to copy all of them and place them in an array. I want some solution if it's possible like [loadJsonContent].
I want some solution if it's possible like [loadJsonContent].
If you want built-in File function for bicep, the answer is NO.
From the official document, File functions for Bicep only have three:
loadFileAsBase64
Loads the file as a base64 string.
loadJsonContent
Loads the specified JSON file as an Any object.
loadTextContent
Loads the content of the specified file as a string.
I think your requirement needs to achieve via writing code.
By the way, you didn't clearly define the excel file, xlsx or csv? And if possible, please provide a sample file format so that we can provide specific code.
For example, I have a Student.xlsx file like this(CSV file is also this structure.):
Then I can use this Python code to parse and get the data I want:
import os
import openpyxl
import csv
#get the student name from 'Student Name' of sheet 'Tabelle1' of the file 'XLSX_Folder/Student.xlsx'
def get_student_name(file_path, sheet_name, col):
student_name = []
#if file_path ends with '.xlsx'
if file_path.endswith('.xlsx'):
wb = openpyxl.load_workbook(file_path)
sheet = wb[sheet_name]
#get all the values under the column 'Student Name'
for i in range(col, sheet.max_row+1):
student_name.append(sheet.cell(row=i, column=col).value)
print('This is xlsx file.')
return student_name
elif file_path.endswith('.csv'):
#get all the values under the column 'Student Name', except the first row
with open(file_path, 'r') as f:
reader = csv.reader(f)
for row in reader:
if reader.line_num == 1:
continue
student_name.append(row[col-1])
print('This is csv file.')
return student_name
print('This is csv file.')
else:
print('This is other format file.')
XLSX_file_path = 'XLSX_Folder/Student.xlsx'
CSV_file_path = 'CSV_Folder/Student.csv'
sheet_name = 'Tabelle1'
col = 2
print(get_student_name(XLSX_file_path, sheet_name, col))
print(get_student_name(CSV_file_path, sheet_name, col))
Result:
After that, parse your bicep file and put the above data into your bicep file.
The above code is just a demo, you can write your own code with the develop language you like. Anyway, no built-in feature of your requirement.

create a script to classify elements of a CSV file

I have the list of all the airports in the world in a CSV file. I would like if it is possible to create a script which allows to create a folder (name of the country) and to put all the airports of the same country in the same folder and to do this automatically for all the countries present in the CSV files.
Thanks for your help.
I am assuming that you have a csv file called input.csv which contains a column named Country
The following python script creates a folder for every distinct country in the input file and appends the airport data in a file called data.csv inside that folder.
import os
import csv
countries = []
with open('input.csv', 'r') as read_obj:
csv_dict_reader = csv.DictReader(read_obj)
for row in csv_dict_reader:
if row["Country"] not in countries:
countries.append(row["Country"])
try:
os.mkdir(row["Country"])
except FileExistsError:
print(row["Country"], " already exists")
with open(row["Country"] + '/data.csv', 'a+') as f:
writer = csv.DictWriter(f, row.keys())
writer.writerow(row)
You might want to check pandas for another way to achieve this.
The following script reads 2 different csv files containting data the reference the same airport codes, creates a folder for each airport code and saves it's data in 2 different files: one for each input. Change the output and input filenames according to your needs.
import os
import pandas as pd
df1 = pd.read_csv('input.csv')
df2 = pd.read_csv('input1.csv')
for c in df1['code'].unique():
try:
os.mkdir(c)
except FileExistsError:
print(c, " already exists")
df1.loc[df1["code"] == c].to_csv(c + '/output1.csv', index=False)
for c in df2['code'].unique():
try:
os.mkdir(c)
except FileExistsError:
print(c, " already exists")
df2.loc[df2["code"] == c].to_csv(c + '/output2.csv', index=False)

Extracting csv file rows as individual .txt files

I am new to Python and trying to extract certain data from rows of a csv file into individual .txt files (to create a corpus for NLP). So far I have the following:
import csv
with open(r"file.csv", "r+", encoding='utf-8') as f:
reader = csv.reader(f)
data = list(reader)
t = (data[1][91])
fn = str(data[1][90])
g = open("%s.txt" %fn,"w+")
for i in range(1):
g.write(t)
g.close
Which does what I want for the 1st row, however I am not sure how to get the program to loop up to row 1047. Note: the [1] signifies row 1, the [91] & [90] should remain fixed.
Thanks in advance!

How to add/change column names with pyarrow.read_csv?

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898

How to read in a certain csv file column (up and down) Python

I have a cvs file with data looking like:
lastname firstname id
segre alberto 14562
I want to just read in the column with the id numbers
everything I try keeps giving me the line not the column of the csv file
import csv
import operator
idgetter = operator.itemgetter(2)
with open('path/to/file') as infile:
infile.readline()
ids = [idgetter(row) for row in csv.reader(infile)]
You could use Pandas.
import pandas as pd
col = pd.read_csv('/path/to/file/')['id']
And if you want it as a list, simply list(col) will do the trick.