how to save dataframe into csv pyspark - csv

I am trying to save dataframe into hdfs system.
It gets saved as part-0000 and into multiple parts.
I want to save it as an excel sheet or just one part file?
How can we achieve this?
code used so far:
df1.write.csv('/user/gtree/tree.csv')

Your dataframe is being saved based on its partitions(multiple partitions= multiple files). You can coalesce or bring your partitions down to 1, so that only 1 file can be written.
Link:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.coalesce
df1.coalesce(1).write.csv('/user/gtree/tree.csv')

You can use .repartition(1) to set the partitions to only 1
df.repartition(1).save(filePath)

Related

How to add a row in a CSV file in pentaho data integration

I need to add a row data in a CSV file using Pentaho Data Integration.
I've tried with this transformation
This is my CSV file input configuration
and this is the CSV file output configuration (with the "append" check activated ...)
My constant definition
and this is my CSV file sample
I'd like to have this
Any suggestion will be appreciated!
You can use the Data grid step to create your constant data and the Append streams step to merge two streams into one in your desired order (data type in two streams must be matched and the same order) and then you can write the data to a CSV file. If you don't need a header present in the CSV file you can uncheck the "Header" option in the content tab

Importing specific columns from a CSV into excel

I am trying to do what the title says and also do it for new records. I cannot link the CSV file because it exceeds the 255 limit. So i am attempting to split up the table.
I have the below table in access
DateOfTest
Time
PromptTime
TestSequence
PATResults
Logs
Serial Number
1
2
3
4
5
6
7
Obviously, where the numbers are i want the data from the CSV to be inserted.
I have created a form including a button so i can run some VBA, but i cannot find the correct information online for my work, as i am new to VBA it is also a bit confusing.
I have attempted some random code, but i was just spraying and praying at that point
I am not sure I understood your question. In the impoer tool you can choose columns, but if you want to do it with a script, I would suggest to perform pre-processing phase with simple python and pandas to read the csv file, remove any unwanted columns and save to another CSV to be uploaded directly to excel.
something like this
import pandas as pd
df = pd.read_csv ('csvfile.csv')
df.drop('column_name', inplace=True, axis=1)
df.to_excel ('filename.xlsx', index = False, header=True)

How to write on second sheet of CSV file using TDI/SDI?

I want to write some data on second sheet of a CSV file using FileConnector in IBM TDI/SDI.
The first sheet of the same file has data which should not be over written.
Is it possible to do so ?
Any lead will be appreciated! Thank you
Csv files do not have 'sheets'.
They are files with tabular data having only one structure for the whole file, resulting in a single table.

I want to write from 2nd row while writing DataFrame to csv file using Apache Spark (Scala API)

While writing DataFrame to csv file using something like:
df.write.format("com.databricks.spark.csv").option("header", "true").save("file.csv")
It is always writing from first row, but I want to write from second row. How can I write from second row?
You can perform following steps to achieve that.
Get the first row object using df.first().
Filter the original dataframe based on this row using filter method.
You can save the filtered dataframe to CSV using your code.
Hope this helps!

How to Get Data from CSV File and Send them to Excel Using Pentaho?

I have a tabular csv file that has seven columns and containing the following data:
ID,Gender,PatientPrefix,PatientFirstName,PatientLastName,PatientSuffix,PatientPrefName
2 ,M ,Mr ,Lawrence ,Harry , ,Larry
I am new to pentaho and I want to design a transformation that moves the data (values of the 7 columns) to an empty excel sheet. The excel sheet has different column names, but should carry the same data, as shown:
prefix_name,first_name,middle_name,last_name,maiden_name,suffix_name,Gender,ID
I tried to design a transformation using the following series of steps, but it gives me errors at the end that I could not interpret them.
What is the proper design to move the data from the csv file to the excel sheet in this case? Any ideas to solve this problem?
As #Brian.D.Myers mentioned in the comment you can use select values step. But here is how you do it step by step explanation.
Select all the fields from CSV file input step.
Configure the select values step as follows.
In the Content tab of Excel writer step click on Get fields button and fill the fields. Alternatively you can use Excel output step as well.