Parsing csv file with vim - csv

I have a large CSV file structured as follows:
CHINESE TRANSLATION
我去上学。 Wǒ qù shàngxué. I am going to school. 上 ♦ on, on top of ♦ go to
我去过北京。 Wǒ qùguò Běijīng. I've been to Beijing. 京 -- ♦ national capital ♦ Beijing
....
The TRANSLATION column blends together three different informations: the pinyin, the English translation and additional information. These three types of information are always present and always presented in the same way and separated by a dot.
What I want to achieve is to create three different columns from the TRANSLATION column, ie to get :
CHINESE PINYIN TRANSLATION ADDITIONAL
我去上学。 Wǒ qù shàngxué. I am going to school. 上 ♦ on, on top of ♦ go to
....
Using a vim macro, how can I do this ?

I think vim macros can handle this job, but executing a vim macro on a big file several thousand times is very slow. So if you just want your job done, I have just wrote a python script, and I think it could give you what you want.
import csv
# change 'in.csv' and 'out.csv'
# to your exact file names.
with open('in.csv', 'r') as infile:
with open('out.csv', 'w') as outfile:
csvreader = csv.reader(infile)
for a, b in csvreader:
line = a + ',' + ','.join(b.split('.'))
outfile.writelines(line)

Related

Load a CSV onto Apache Beam where there is a comma in some of the fields

I am loading a CSV into Apache Beam, but the CSV I am loading has commas in the fields. It looks like this:
ID, Name
1, Barack Obama
2, Barry, Bonds
How can I go about fixing this issue?
This is not specific to Beam, but a general problem with CSV. It's unclear if the second line should have ID="2, Barry" Name="Bonds" or the other way around.
If you can use some context (e.g. ID is always an integer, only one field could possibly contain commas) you could solve this by reading it as a text file line by line and parsing it into separate fields with a custom DoFn (assuming rows also contain newlines).
Generally, non-separating commas should be inside quotes in well-formed CSV, which makes this much more tractable (e.g. it would just work with the Beam Dataframes read_csv.)

How can I write certain sections of text from different lines to multiple lines?

So I'm currently trying to use Python to transform large sums of data into a neat and tidy .csv file from a .txt file. The first stage is trying to get the 8-digit company numbers into one column called 'Company numbers'. I've created the header and just need to put each company number from each line into the column. What I want to know is, how do I tell my script to read the first eight characters of each line in the .txt file (which correspond to the company number) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Company number'])
outputDictWriter.writeheader()
rf = rf.read(8)
for line in rf:
wf.write(line)
My recommendation would be 1) read the file in, 2) make the relevant transformation, and then 3) write the results to file. I don't have sample data, so I can't verify whether my solution exactly addresses your case
with open('input.txt','r') as file_handle:
file_content = file_handle.read()
list_of_IDs = []
for line in file_content.split('\n')
print("line = ",line)
print("first 8 =", line[0:8])
list_of_IDs.append(line[0:8])
with open("output.csv", "w") as file_handle:
file_handle.write("Company\n")
for line in list_of_IDs:
file_handle.write(line+"\n")
The value of separating these steps is to enable debugging.

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

How to convert these text segments to CSV?

I'm trying to make a CSV file out of a text file with html tags with questions and answers.
Unfortunately, I've tried and wasn't able to open it in LibreOffice or any other CSV compatible software.
What I'm trying to do is to convert something like this:
<b>Question 1:</b> What is the capital of United States?
<b>Answer:</b> Washington, D.C.
<b>Question 2:</b> What is the capital of United States?
<b>Answer:</b> Washington, D.C.
<b>Question 3:</b> What is the capital of United States?
<b>Answer:</b> Washington, D.C.
And so on.
The result should be:
Question SEPARATOR Answer
*SEPARATOR cannot be a color or semicolon because the question might contain most important characters (colon, semicolon, dots)
I want to import into Anki and it supports CSV files.
I've tried separating Question and Answer with a special symbol like # and only the questions are parsed in LibreOffice/OpenOffice and the question text can never contain a line break. If a text contains a line break, the CSV gets messed up.
Here's a little Python script to convert your cards to CSV format:
import re
import csv
questions = [[]]
with open("original.file", "rt") as f:
for line in f:
line = line.strip()
if line:
questions[-1].append(line)
else:
questions.append([])
# Now we've got the questions in a 2D list.
# Let's just make sure it loaded properly.
assert all(len(question) == 2 for question in questions)
# Let's write a CSV file now.
with open("output.csv", "wt", newline="") as f:
writer = csv.writer(f)
for q, a in questions:
writer.writerow([
re.fullmatch(r"<b>Question \d+:</b> (.*)", q).group(1),
re.fullmatch(r"<b>Answer:</b> (.*)", a).group(1)
])
Now you can import these with the "Basic" card type. This code discards the question number; I hope that wasn't too important.

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.