Formatting a CSV file with multiple paraphrased sentences for a given input sentence - csv

I am trying to create a dataset for T5 model and I want to include multiple paraphrased sentences for a given input sentence in my CSV file.
For example, the input sentence is "The cat sat on the mat." and I have two different paraphrased versions of this sentence:
The cat was resting on the rug
The cat was seated on the rug
I want to include both of these versions as the target sentences for the input sentence in my CSV file.
My current CSV file look something like this:
Input
Target
The cat sat on the mat
The cat was resting on the rug
I'm going to store
I'm heading to the store
Now where can I place the other versions of paraphrased sentences in the above dataset?
Here's my code:
# instantiate
model = SimpleT5()
# load (supports t5, mt5, byT5 models)
model.from_pretrained("t5","google/flan-t5-base")
path = "dataset.csv"
df = pd.read_csv(path)
train_df, test_df = train_test_split(df, test_size=0.2)
model.train(train_df=train_df,
eval_df=test_df,
source_max_token_len=128,
target_max_token_len=50,
batch_size=8, max_epochs=3, use_gpu=False)`

Related

How can I write certain sections of text from different lines to multiple lines?

So I'm currently trying to use Python to transform large sums of data into a neat and tidy .csv file from a .txt file. The first stage is trying to get the 8-digit company numbers into one column called 'Company numbers'. I've created the header and just need to put each company number from each line into the column. What I want to know is, how do I tell my script to read the first eight characters of each line in the .txt file (which correspond to the company number) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Company number'])
outputDictWriter.writeheader()
rf = rf.read(8)
for line in rf:
wf.write(line)
My recommendation would be 1) read the file in, 2) make the relevant transformation, and then 3) write the results to file. I don't have sample data, so I can't verify whether my solution exactly addresses your case
with open('input.txt','r') as file_handle:
file_content = file_handle.read()
list_of_IDs = []
for line in file_content.split('\n')
print("line = ",line)
print("first 8 =", line[0:8])
list_of_IDs.append(line[0:8])
with open("output.csv", "w") as file_handle:
file_handle.write("Company\n")
for line in list_of_IDs:
file_handle.write(line+"\n")
The value of separating these steps is to enable debugging.

Zapier Code Step Model Data into CSV

I'm looking for help with some JavaScript to insert inside of a code step in Zapier. I have two inputs that are named/look like the following:
RIDS: 991,992,993
LineIDs: 1,2,3
Each of these should match in the quantity of items in the list. There can be 1, 2 or 100 of them. The order is significant.
What I'm looking for is a code step to model the data into one CSV matching up the positions of each. So using the above data, my output would look like this:
991,1
992,2
993,3
Does anyone have code or easily know how to achieve this? I am not a JavaScript developer.
Zapier doesn't allow you to create files in a code step. You can, though, use the code step to generate text which can then be used in another step. I used Python for my example (I'm not as familiar with Javascript but the strategy is the same).
Create CSV file in Zapier from Raw Data
Code Step with LindeIDs and RIDs as inputs
import csv
import io
# Convert inputs into lists
lids = input_data['LineIDs'].split(',')
rids = input_data['RIDs'].split(',')
# Create file-like CSV object
csvfile = io.StringIO()
filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
# Write CSV rows
filewriter.writerow(['LineID', 'RID'])
for x in range(len(lids)):
filewriter.writerow([lids[x], rids[x]])
# Get CSV object value as text and set to output
output = {'text': csvfile.getvalue()}
Use a Google Drive step to Create File from Text
File Content = Text from Step 1
Convert to Document = no
This will create a *.txt document
Use a CloudConvert step to Convert File from txt to csv.

Editing a .csv file with a batch file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
this is my first question on here. I work as a meteorologist and have some coding experience, though it is far from professionally taught. Basically what I have is a .csv file from a weather station that is giving me data that is too detailed. (65.66 degrees and similar values) What I want to do is automate a way via a script file that would access the .csv file and get rid of values that were too detailed. (Take a temp from 65.66 to 66 (rounding up for anything above .5 and down for below) or for a pressure (29.8889) and making it (29.89) using the same rounding rules.) Is this possible to be done? If so how should I go about it. Again keep in mind that my coding skills for batch files are not the strongest.
Any help would be much appreciated.
Thanks,
I agree with the comments above. Math in batch is limited to integers, and won't work well for the manipulations you want.
I'd use PowerShell. Besides easily handling floating point math, it also has built-in methods for objectifying CSV data (as well as XML and other types of structured data). Take the following hypothetical CSV data contained within weather.csv:
date,time,temp,pressure,wx
20160525,12:30,65.66,30.1288,GHCND:US1TNWS0001
20160525,13:00,67.42,30.3942,GHCND:US1TNWS0001
20160525,13:30,68.92,31.0187,GHCND:US1TNWS0001
20160525,14:00,70.23,30.4523,GHCND:US1TNWS0001
20160525,14:30,70.85,29.8889,GHCND:US1TNWS0001
20160525,15:00,69.87,28.7384,GHCND:US1TNWS0001
The first thing you want to do is import that data as an object (using import-csv), then round the numbers as desired -- temp rounded to a whole number, and pressure rounded to a precision of 2 decimal places. Rounding to a whole number is easy. Just recast the data as an integer. It'll be rounded automatically. Rounding the pressure column is pretty easy as well if you invoke the .NET [math]::round() method.
# grab CSV data as a hierarchical object
$csv = import-csv weather.csv
# for each row of the CSV data...
$csv | foreach-object {
# recast the "temp" property as an integer
$_.temp = [int]$_.temp
# round the "pressure" property to a precision of 2 decimal places
$_.pressure = [math]::round($_.pressure, 2)
}
Now pretend you want to display the temperature, barometric pressure, and weather station name where "date" = 20160525 and "time" = 14:30.
$row = $csv | where-object { ($_.date -eq 20160525) -and ($_.time -eq "14:30") }
$row | select-object pressure,temp,wx | format-table
Assuming "pressure" started with a value of 29.8889 and "temp" had a value of 70.85, then the output would be:
pressure temp wx
-------- ---- --
29.89 71 GHCND:US1TNWS0001
If the CSV data had had multiple rows with the same date and time values (perhaps measurements from different weather stations), then the table would display with multiple rows.
And if you wanted to export that to a new csv file, just replace the format-table cmdlet with export-csv destination.csv
$row | select-object pressure,temp,wx | export-csv outfile.csv
Handy as a pocket on a shirt, right?
Now, pretend you want to display the human-readable station names rather than NOAA's designations. Make a hash table.
$stations = #{
"GHCND:US1TNWS0001" = "GRAY 1.5 E TN US"
"GHCND:US1TNWS0003" = "GRAY 1.9 SSE TN US"
"GHCND:US1TNWS0016" = "GRAY 1.3 S TN US"
"GHCND:US1TNWS0018" = "JOHNSON CITY 5.9 NW TN US"
}
Now you can add a "station" property to your "row" object.
$row = $row | select *,"station"
$row.station = $stations[$row.wx]
And now if you do this:
$row | select-object pressure,temp,station | format-table
Your console shows this:
pressure temp station
-------- ---- -------
29.89 71 GRAY 1.5 E TN US
For extra credit, say you want to export this row data to JSON (for a web page or something). That's slightly more complicated, but not impossibly so.
add-type -AssemblyName System.Web.Extensions
$JSON = new-object Web.Script.Serialization.JavaScriptSerializer
# convert $row from a PSCustomObject to a more generic hash table
$obj = #{}
# the % sign in the next line is shorthand for "foreach-object"
$row.psobject.properties | %{
$obj[$_.Name] = $_.Value
}
# Now, stringify the row and display the result
$JSON.Serialize($obj)
The output of that should be similar to this:
{"station":"GRAY 1.5 E TN US","wx":"GHCND:US1TNWS0001","temp":71,"date":"201605
25","pressure":29.89,"time":"14:30"}
... and you can redirect it to a .json file by using > or pipe it into the out-file cmdlet.
DOS batch scripting is, by far, not the best place to edit text files. However, it is possible. I will include sample, incomplete DOS batch code at the bottom of this post to demonstrate the point. I recommend you focus on Excel (no coding needed) or Python.
Excel - You don't need to code at all with Excel. Open the csv file. Let's say you have 66.667 in cell B12. In cell C12 enter a formula using the round function (code below). You can also teach yourself some Visual Basic for Applications. But, for this simple task, that is overkill. When done, if you save as csv format, you will loose your formulae and only have data. Consider saving as xlsx or xlsm.
Visual Basic Script - you can run vbscript on your machine with
cscript.exe (or wscript.exe), which is part of Windows. But, if using VB script, you might as well use VBA in Excel. It is almost identical.
Python is a very high level langauge with built in libraries
that make editing a csv file super easy. I recommend Anaconda
(a Python suite) from continuum.io. But, you can find the generic Python at
python.org as well. Anaconda will come prepackaged with lots of
helpful libraries. For csv editing, you will likely want to use the
pandas library. You can find plenty of short videos on YouTube.
Excel
Say you have 66.667 in cell B12. Set the formula in C13 to...
"=ROUND(B12,0)" to round to integer
"=ROUND(B12,1)" to round to one decimal place
As you copy and past, Excel will attempt to intelligently update the formulas for you.
Python
import pandas as pd
from StringIO import StringIO
import numpy as np
# load csv file to memory. Name your columns "using names=[]"
df = pd.read_csv(StringIO("C:/temp/weather.csv"), names=["city", "temperature", "date"])
df["temperature"].apply(np.round) #you just rounded the temperature column
pd.to_csv('newfile.csv') # export to a new csv file
pd.to_xls('newfile.xls') # or export to an excel file instead
DOS Batch
A Batch script for this is much, much harder. I will not write the whole program, because it is not a great solution. But, I'll give you a taste in DOS batch code at the bottom of this post. Compared to using Python or Excel, it is extremely complex.
Here is a rough sketch of DOS code. Because I don't recommend this method, I didn't take the time to debug this code.
setlocal ENABLEDELAYEDEXPANSION
:: prep our new file for output. Let's write the header row.
echo col1, col2, col3 >newfile.csv
:: read the existing text file line by line
:: since it is csv, we will parse on comma
:: skip lines starting with semi-colon
FOR /F "eol=; tokens=2,3* delims=, " %%i in (input_file.txt) do (
set col1=%%I, set col2=%%J, set col3=%%K
:: truncate col2 to 1 decimal place
for /f "tokens=2 delims==." %%A in ("col2") do (
set integer=%%A
set "decimal=%%B
set decimal=%decimal:~0,1%
:: or, you can use an if statement to round up or down
:: Now, put the integer and decimal together again and
:: redefine the value for col2.
set col2=%integer%.%decimal%
:: write output to a new csv file
:: > and >> can redirect output from console to text file
:: >newfile.csv will overwrite file.csv. We don't want
:: that, since we are in a loop.
:: >>newfile.csv will append to file.csv, perfect!
echo col1, col2, col3 >>newfile.csv
)
)
:: open csv file in default application
start myfile.csv

Reading XML data into R from a html source

I'd like to import data into R from a given webpage, say this one.
In the source code (but not on the actual page), the data I'd like to get is stored in a single line of javascript code which starts like this:
chart_Line1.setDataXML("<graph rotateNames (stuff omitted) >
<set value='699.99' name='16.02.2013' />
<set value='731.57' name='18.02.2013' />
<set value='more values' name='more dates' />
...
<trendLines> (now a different command starts, stuff omitted)
</trendLines></graph>")
(Note that I've included line breaks for readability; the data is in one single line in the original file. It would suffice to import only the line which starts with chart_Line1.setDataXML - it's line 56 in the source if you want to have a look yourself)
I can read the whole html file into a string using scan("URLofFile", what="raw"), but how do I extract the data from this?
Can I specify the data format with what="...", keeping in mind that there are no line breaks to separate the data, but several line breaks in the irrelevant prefix and suffix?
Is this something which can be done in a nice way using R tools, or do you suggest that this data acquisition should rather be done with a different script?
With some trial & error, I was able to find the exact line where the data is contained. I read the whole html file, and then dispose of all other lines.
require(zoo)
require(stringr)
# get html data, scrap all lines but the interesting one
theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod"
sec <- scan(file =theurl, what = "character", sep="\n")
sec <- sec[45]
# extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places
values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'")
# dispose of all non-numerical, non-separator values
values <- str_replace_all(unlist(values),"[^0-9/.]","")
# get all dates in the form "name='DD.MM.YYYY"
dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'")
# dispose of all non-numerical, non-separator values
dates <- str_replace_all(unlist(dates),"[^0-9/.]","")
# convert dates to canonical format
dates <- as.Date(dates,format="%d.%m.%Y")
# put values and dates into a list of ordered observations, converting the values from characters to numbers first.
MyZoo <- zoo(as.numeric(values),dates)

Parsing csv file with vim

I have a large CSV file structured as follows:
CHINESE TRANSLATION
我去上学。 Wǒ qù shàngxué. I am going to school. 上 ♦ on, on top of ♦ go to
我去过北京。 Wǒ qùguò Běijīng. I've been to Beijing. 京 -- ♦ national capital ♦ Beijing
....
The TRANSLATION column blends together three different informations: the pinyin, the English translation and additional information. These three types of information are always present and always presented in the same way and separated by a dot.
What I want to achieve is to create three different columns from the TRANSLATION column, ie to get :
CHINESE PINYIN TRANSLATION ADDITIONAL
我去上学。 Wǒ qù shàngxué. I am going to school. 上 ♦ on, on top of ♦ go to
....
Using a vim macro, how can I do this ?
I think vim macros can handle this job, but executing a vim macro on a big file several thousand times is very slow. So if you just want your job done, I have just wrote a python script, and I think it could give you what you want.
import csv
# change 'in.csv' and 'out.csv'
# to your exact file names.
with open('in.csv', 'r') as infile:
with open('out.csv', 'w') as outfile:
csvreader = csv.reader(infile)
for a, b in csvreader:
line = a + ',' + ','.join(b.split('.'))
outfile.writelines(line)