how to use delimeter= , for cleaning the csv file data - csv

I am new to python.
I have one zipped folder. First I need to extract the CSV files in a new folder and then read all data and merge the data into one new file.
I successfully unzip the folder and was able to read the data from all files.
However, I am stuck with the data cleaning operations.
I am getting data in col A and need to split the data by using delimiter = “,”
Can someone advise about that? Please
[code][1]

Related

Azure Data Factory - Unzip single file with multiple csv files being copied to different destinations

We are ingesting a zip file from blob storage with Azure Data Factory. Within this zipped file there are multiple csv files which need to be copied to Azure SQL tables.
For example, lets say zipped.zip contains Clients.csv, Accounts.csv, and Users.csv. Each of these csv files will need to copied to a different destination table - dbo.Client, dbo.Account, and dbo.User.
It seems that I would need to use a Copy Activity to unzip and copy these files to a staging location, and from there copy those individual files to their respective tables.
Just want to confirm that there is not a different way to unzip and copy all in one action, without using a staging location? Thanks!
Your thoughts are correct, there is no direct way without using a staging aspect as long as you are not writing any custom code logic and leveraging just the Copy activity
a somewhat similar thread: https://learn.microsoft.com/en-us/answers/questions/989436/unzip-only-csv-files-from-my-zip-file.html

Copying multiple files from one folder to another in the same S3 bucket

I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.

Automatically updating a MySQL database when a new csv file is saved inside a folder

I am trying to create some sort of an event or script that detects when a new csv is saved into a folder, and then inserts the content of this csv folder inside a MySQL server.
Is this possible? If it is how can I approach this problem and what materials should I study to make this happen?
I am using .net to create applications right now and created a csv importer using OpenFileDialog() However my boss wants me automatically scrape a csv whenever a new file is saved.

JMeter read the second sheet of CSV

How can I make JMeter read the second sheet of my CSV?
I want to use CSV Data Set Config.
Normally, it reads the first line of the first sheet but is there any way to be a bit more flexible?
CSV file format doesn't have "sheets", it is a normal plain text file using delimiters in order to represent structured data.
If you are trying to get data from i.e. Microsoft Excel file type - unfortunately you won't be able to do it using CSV Data Set Config. The easiest would be exporting data as separate plain-text CSV files.
If you don't have the possibility to do the export you still can access the data from Excel files but it will be a little bit more tricky as you will have to use JSR223 Test Elements, Groovy language and Apache POI libraries
More information:
Busy Developers' Guide to HSSF and XSSF Features
How to Extract Data From Files With JMeter
Currently you can use CSV Data Set Config for that, you should add external code for example using Apache Commons CSV,
Download the jar file and place it in JMETER_HOME lib folder, and then write the code in JSR223 Element.
Examples exists, code for get second record:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in);
// go to next record
records.next();
CSVRecord secondRecord = records.next();
//columnOne = secondRecord.get(0);

How to overwrite Excel destination in SSIS?

I have created a package to fetch data from two SQL Server tables, and using merge join combined this data, then stored the result into an Excel destination.
The first time it works fine. The second time it stores repeated data in the Excel file.
How do I overwrite the Excel file rows?
Yes, Possible!
Here is the solution:
First go to your Excel Destination Click to New Button next to Name of Excel Sheet, copy the DML query inside.
Then put an Execute SQL Task into your Control Flow and connect it to your data flow that contains Excel destination. Set the Connection Type To Excel, Set the Connection to your Excel Destination's Excel Connection Manager, go to SQL Statement and type :
Drop TABLE `put the name of the sheet in the excel query you just copied`
Go
finally paste the query after it.
It is all you need to do to solve the problem.
You can refer to this link for a complete info:
http://dwhanalytics.wordpress.com/2011/04/07/ssis-dynamically-generate-excel-tablesheet/
Yes, Possible!
Using SSIS we can solve this problem:
first of all, crate an Excel format file (Structure Format using Excel Connection Manager) at one location as a template file. Then create a copy of that excel file using FILE SYSTEM TASK in another location and make sure that SET OverwriteDestination=True. Finally, using a data flow task, insert data into the new copied file. whenever we want insert data, it will create a copy of the template excel file and then load the data.
Unfortunately the Excel connection manager does not have a setting that allows overwriting the data. You'll need to set up some file manipulation using the File System Task in the Control Flow.
There are several possibilities, here's one of them. You can create a template file (which just contains the sheet with the header) and prior to the Data Flow Transformation a File System Task copies it over the previously exported file.
The File System Task (MSDN)
For Excel it will append data. There is no such option available for overwriting data.
You have to delete and recreate the file through the File System task.
Using a CSV file with flat-file connection manager would serve your purpose of overwriting.
The best solution for me was using File System Tasks to delete and recreate the Excel files from a template.
What I was trying to do was to send every employee a report with Excel attachment in the same format but different data. In a foreach container for each employee, I get the required data, create an Excel file and send a mail with the Excel file attached.
I first:
Create an Excel template (manually)
Create an original Excel file to be used (manually)
Then in the foreach container:
Delete the original file (SSIS File System Task )
Copy the template as the original file (SSIS File System Task)
Get the data from SQL Server and write them to the original file (SSIS Data Flow Task)
Send the mail (SSIS -> SQL Stored Procedure)