Text column issues while converting xlsx to to csv - csv

I am having troubles converting a xlsx file to csv format. Somehow it does not copy the contents of the columns that contain text.
I tried : python xlsx2csv-0.20/xlsx2csv.py -s 2 -d ';' 'testin.xlsx' 'testout.csv'
The result should look like:
"www.vistaheads.com";"http://www.vistaheads.com/forums/microsoft-public-windows-vista-general/200274-vista-mbr-vs-xp-mbr-4.html";"YahooBossAPIv2";;"eng";"ie";;9/8/2010;TRUE;FALSE;;0;-8.2666666667;0;0;0;0
"www.drpletsch.com";"http://www.drpletsch.com/elos-acne-treatment.html";"Oxyme.Searchv3.0.0";;"eng";;;7/31/2012;TRUE;FALSE;;;;0;0;0;0
"www.charterhouse-aquatics.co.uk";"http://www.charterhouse-aquatics.co.uk/catalog/elos-systemmini-marine-litre-aquarium-black-p-7022.html";"YahooBossAPIv2";;"eng";"us";;7/11/2012;TRUE;FALSE;;1;5.6666666667;0;0;0;0
"www.proz.com";"http://www.proz.com/kudoz/latin_to_english/religion/4794760-concio_melos_tinnulo.html";"YahooBossAPIv2";;"eng";"in";;5/7/2012;TRUE;FALSE;;1;3;0;0;0;0
"schoee.blogspot.co.uk";"http://schoee.blogspot.co.uk/2010/08/review-body-shop-vitamin-c-facial.html";"YahooBossAPIv2";;"eng";;;8/1/2010;TRUE;FALSE;;1;1;0;0;0;0
But instead I get:
;;;;;;;09-08-10;TRUE;FALSE;;0.0;-8.266666666666666;0.0;0.0;0.0;0.0;
;;;;;;;07-31-12;TRUE;FALSE;;;;0.0;0.0;0.0;0.0;
;;;;;;;07-11-12;TRUE;FALSE;;1.0;5.666666666666667;0.0;0.0;0.0;0.0;
;;;;;;;05-07-12;TRUE;FALSE;;1.0;3.0;0.0;0.0;0.0;0.0;
;;;;;;;08-01-10;TRUE;FALSE;;1.0;1.0;0.0;0.0;0.0;0.0;
;;;;;;;09-08-10;TRUE;FALSE;;0.0;0.033333333333333354;0.0;0.0;0.0;0.0;
;;;;;;;07-03-12;TRUE;FALSE;;1.0;2.0;0.0;0.0;0.0;0.0;
;;;;;;;10-18-11;TRUE;FALSE;;1.0;4.666666666666667;0.0;0.0;0.0;0.0;
I also tried using ssconvert, but here I get similar outcomes i.e. :
ssconvert -S 'testin.xlsx' testout2.csv
Also here the textual contents somehow vanished:
2010/09/08,TRUE,FALSE,,0,-8.26666666666667,0,0,0,0
"2012/07/31 09:58:39.823",TRUE,FALSE,,,,0,0,0,0
"2012/07/11 13:35:09.220",TRUE,FALSE,,1,5.66666666666667,0,0,0,0
2012/05/07,TRUE,FALSE,,1,3,0,0,0,0
2010/08/01,TRUE,FALSE,,1,1,0,0,0,0
2010/09/08,TRUE,FALSE,,0,0.03333333333333,0,0,0,0
"2012/07/03 22:24:03.467",TRUE,FALSE,,1,2,0,0,0,0
2011/10/18,TRUE,FALSE,,1,4.66666666666667,0,0,0,0
"2012/07/22 02:10:58.313",TRUE,FALSE,,1,2,0,0,0,0
"2012/08/02 17:01:39.637",TRUE,FALSE,,1,1,0,0,0,0
2010/06/05,TRUE,FALSE,,1,4,0,0,0,0
"2012/07/25 16:11:47.843",TRUE,FALSE,,1,2,0,0,0,0
2012/09/26,TRUE,TRUE,1,,,1,0,0,1
2012/04/29,TRUE,TRUE,2,,,8,3,1,4
2012/07/22,TRUE,FALSE,,0,0.03333333333333,0,0,0,0
2012/05/01,TRUE,FALSE,,1,14,0,0,0,0
"2012/08/07 06:17:39.647",TRUE,FALSE,,1,1,0,0,0,0
"2012/07/18 15:15:19.283",TRUE,FALSE,,1,3,0,0,0,0
2012/07/27,TRUE,FALSE,,1,0.33333333333333,0,0,0,0
2010/09/08,TRUE,FALSE,,1,0.33333333333333,0,0,0,0
"2012/07/21 18:10:57.700",TRUE,FALSE,,1,0.33333333333333,0,0,0,0
The Excel file looks fine to me. Any ideas what could be going wrong ?
The Excel file is generated using Apache POI, maybe that's a clue?
Kind regards,
Rianne

Related

How do I preserve the leading 0 of a number using Unoconv when converting from a .csv file to a .xls file?

I have a 3 column csv file. The 2nd column contains numbers with a leading zero. For example:
044934343
I need to convert a .csv file into a .xls and to do that I'm using the command line tool called 'unoconv'.
It's converting as expected, however when I load up the .xls in Excel instead of showing '04493434', the cell shows '4493434' (the leading 0 has been removed).
I have tried surrounding the number in the .csv file with a single quote and a double quote however the leading 0 is still removed after conversion.
Is there a way to tell unoconv that a particular column should be of a TEXT type? I've tried to read the man page of unocov however the options are little confusing.
Any help would be greatly appreciated.
Perhaps I came too late at the scene, but just in case someone is looking for an answer for a similar question this is how to do:
unoconv -i FilterOptions=44,34,76,1,1/1/2/2/3/1 --format xls <csvFileName>
The key here is "1/1/2/2/3/1" part, which tells unoconv that the second column's type should be "TEXT", leaving the first and third as "Standard".
You can find more info here: https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options#Token_7.2C_csv_import
BTW this is my first post here...

convert CSV to JSON using Python

I need to convert a CSV file to JSON file using Python. I used this,
variable = csv.DictReader(file.csv)
It throws this ERROR
csv.Error: line contains NULL byte
I checked the CSV file in Excel, it shows no NULL chars, but when I printed the data in CSV file using Python. There are some data like SOHNULNULHG (here last 2 letters, HG is the data displaying in the Excel). I need to remove these ASCII chars in the CSV file, while converting to JSON. (i.e. I need only HG from the above string)
I just ran into the same issue. I converted my csv file to csv UTF-8 and ran it again without any errors. That seemed to fix the ASCII char issue. Hope that helps.
To convert the csv type, I just opened my file up in Excel, did save as, then selected CSV UTF-8(Comma delimited)(*.csv) in the Save as type.
Hope that helps.

split big json files into small pieces without breaking the format

I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?

Search and Replace Text in CSV file using Python

I just started with Python 3.4.2 and trying to find and replace text in csv file.
In Details, Input.csv file contain below line:
0,0,0,13,.\New_Path-1.1.12\Impl\Appli\Library\Module_RM\Code\src\Exception.cpp
0,0,0,98,.\Old_Path-1.1.12\Impl\Appli\Library\Prof_bus\Code\src\Wrapper.cpp
0,0,0,26,.\New_Path-1.1.12\Impl\Support\Custom\Vital\Code\src\Interface.cpp
0,0,0,114,.\Old_Path-1.1.12\Impl\Support\Custom\Cust\Code\src\Config.cpp
I maintained my strings to be searched in other file named list.csv
Module_RM
Prof_bus
Vital
Cust
Now I need to go through each line of Input.csvand replace the last column with the matched string.
So my end result should be like this:
0,0,0,13,Module_RM
0,0,0,98,Prof_bus
0,0,0,26,Vital
0,0,0,114,Cust
I read the input files first line as a list. So text which i need to replace came in line[4]. I am reading each module name in the list.csv file and checking if there is any match of text in line[4]. I am not able to make that if condition true. Please let me know if it is not a proper search.
import csv
import re
with open("D:\\My_Python\\New_Python_Test\\Input.csv") as source, open("D:\\My_Python\\New_Python_Test\\List.csv") as module_names, open("D:\\My_Python\\New_Python_Test\\Final_File.csv","w",newline="") as result:
reader=csv.reader(source)
module=csv.reader(module_names)
writer=csv.writer(result)
#lines=source.readlines()
for line in reader:
for mod in module_names:
if any([mod in s for s in line]):
line.replace(reader[4],mod)
print ("YES")
writer.writerow("OUT")
print (mod)
module_names.seek(0)
lines=reader
Please guide me to complete this task.
Thanks for your support!
At-last i succeeded in solving this problem!
The below code works well!
import csv
with open("D:\\My_Python\\New_Python_Test\\Input.csv") as source, open("D:\\My_Python\\New_Python_Test\\List.csv") as module_names, open("D:\\My_Python\\New_Python_Test\\Final_File.csv","w",newline="") as result:
reader=csv.reader(source)
module=csv.reader(module_names)
writer=csv.writer(result)
flag=False
for row in reader:
i=row[4]
for s in module_names:
k=s.strip()
if i.find(k)!=-1 and flag==False:
row[4]=k
writer.writerow(row)
flag=True
module_names.seek(0)
flag=False
Thanks for people who tried to solve! If you have any better coding practices please do share!
Good Luck!

Change .xls to .csv without losing columns

I have a folder with around 400 .txt files that I need to convert to .csv. When I batch rename them to .csv, all the columns get smushed together into one. Same thing happens when I convert to .xls then .csv, even though the columns are fine in .xls. If I open the .xls file and save as to .csv, it's fine, but this would require opening all 400 files.
I am working with sed from the mac terminal. After navigating to the folder that contains the files within the terminal, here is some code that did not work:
for file in *.csv; do sed 's/[[:blank:]]+/,/g'
for file in *.csv; do sed -e "s/ /,/g"
for file in *.csv; do s/[[:space:]]/,/g
for file in *.csv; do sed 's/[[:space:]]{1,}/,/g'
Any advice on how to restore the column structure to the csv files would be much appreciated. And it's probably already apparent but I'm a coding newb so please go easy. Thanks!
Edit: here is an example of how the xls columns look, and how they should look in csv format:
Dotsc.exe 2/12/15 1:17 PM 0 Nothing 1 Practice
Everything that is separated by spaces here (except the space between 7 and PM) are separated by columns in the file. Here is what it looks like when I rename the batch rename the file to .csv:
Dotsc.exe 2/12/15 1:17 PM 0 Nothing 1 Practice
Columns have now turned into spaces, and all data is in the same column. Hope that clarifies things.
I think that what you are trying to do only in batch is not possible . I suggest you to use some library in Java.
Take a look here : http://poi.apache.org/spreadsheet/