How to format TSV files to use with torchtext?

How to format TSV files to use with torchtext? - csv

The way i'm formatting is like:
Jersei N
atinge V
média N
. PU
Programe V
...
First string in each line is the lexical item, the other is a pos tag. But the empty-line (that i'm using to indicate the end of a sentence) gives me the error AttributeError: 'Example' object has no attribute 'text' when running the given code:
src = data.Field()
trg = data.Field(sequential=False)
mt_train = datasets.TabularDataset(
path='/path/to/file.tsv',
fields=(src, trg))
src.build_vocab(train)
How the proper way to indicate EOS to torchtext?

The following code reads the TSV the way i formatted:
mt_train = datasets.SequenceTaggingDataset(path='/path/to/file.tsv',
fields=(('text', text),
('labels', labels)))
It happens that SequenceTaggingDataset properly identifies an empty line as the sentence separator.

Related

GAMS csv read issue

I'm trying to read a .csv file with the following format using MAC:
;lon;lat
0;55,245594;25,066697
1;55,135613;25,070419
2;55,275683;25,203425
What I am doing so far is:
$call csv2gdx coords.csv id=d index=1 values=2..lastCol useHeader=y
sets
i
c /x,y/
;
parameters
dloc(i,c) 'locations'
;
$gdxin clients_csv.gdx
$load ___ ?
What I want to do is read the lat,lon coordinates in the parameter dloc so as for each i to have a pair of coords c, i.e. lat, lon.
Example output:
x y
i1 17.175 84.327

Running your code produces an error from csv2gdx:
*** ErrNr = 15 Msg = Values(s) column number exceeds column count; Index = 2, ColCnt = 1
Per default, csv2gdx expects the entries separated by commas, which you do not have in your data. You could also define semicolon or tab as separator by means of an option, but if the data has really the format you posted, you do not need to call csv2gdx at all. You could just include the data directly like this:
Sets
i
c
;
Table dloc(i<,c<) 'locations'
$include coords.csv
;
Display dloc;
EDIT after change of input data format:
The error message is still the same. And also the reason is the same: You use a different field separator than the default one. If you switch that using the option fieldSep=semiColon, you will realize that also your decimal separator is non-default for csv2gdx. But this can be changed as well. Here is the whole code (with adjusted csv2gdx call and adjustments for data loading). Note that sets i and c get implicitly defined when loading dloc with the < syntax in the declaration of dloc.
$call csv2gdx coords.csv id=d index=1 values=2..lastCol useHeader=y fieldSep=semiColon decimalSep=comma
Sets
i
c
;
parameters
dloc(i<,c<) 'locations'
;
$gdxin coords.gdx
$load dloc=d
Display dloc;
$exit\

how to iterate over xlsx data in octave with mixed types

I am trying to read a simple xlsx file with xlsread in octave. Its csv version is shown below:
2,4,abc,6
8,10,pqr,12
14,16,xyz,18
I am trying to read and write the contents with this code:
[~, ~, RAW] = xlsread('file.xlsx');
allData = cell2mat(RAW); # error with cell2mat()
printf('data nrows=%d, ncolms=%d\n', rows(allData), columns(allData));
for i=1:rows(allData)
for j=1:columns(allData)
printf('data(%d,%d) = %d\n', i,j, allData(i,j));
endfor
endfor
and I am getting the following error:
error: cell2mat: wrong type elements or mixed cells, structs, and matrices
I have experimented with several variations of this problem:
(A) If I delete the column with the text data, ie the xlsx file contains only numbers, then this code works fine.
(B) On the other hand, if I delete the cell2mat() call even for the purely number xlsx, I get an error during the cell access:
error: printf: wrong type argument 'cell'
(C) If I use cell2mat() during printf, like this:
printf('data(%d,%d) = %d\n', i,j, cell2mat(allData(i,j)));
I get correct data for the integers, and garbage for the text items.
So, how can I access and print each cell of the xlsx data, when the xlsx contains mixed-type data?
In other words, given a column index, and given that I know what type of data I am expecting there (integer or string), so how can I re-format the cell type before using it?

A numeric array cannot have multi-class data hence cell2mat fails. Cell-arrays are used to hold such type of data and you already have it in a cell array, so there is no need of conversion and so just skip that line (allData = cell2mat(RAW);).
Within the loop, you have this line:
printf('data(%d,%d) = %d\n', i, j, allData(i,j) );
% ↑ ↑ ↑
% 1 2a 2b
The problems are represented by up-arrows.
You've mixed data in your cell array but you're using %d as the data specifier. You can fix this by converting all of your data to string and then use %s as the specifier.
If you use square brackets ( ) for indexing a cell array, you will get a cell. What you need here is the content of that cell and braces { } are used for that.
So it will be:
printf('data(%d,%d) = %s\n', i,j, num2str(RAW{i,j}));
Note that instead of all that, you can simply just enter RAW to get this:
octave:1> RAW
RAW =
{
[1,1] = 2
[2,1] = 8
[3,1] = 14
[1,2] = 4
[2,2] = 10
[3,2] = 16
[1,3] = abc
[2,3] = pqr
[3,3] = xyz
[1,4] = 6
[2,4] = 12
[3,4] = 18
}

EOF Error During Dict Slice

I am trying to compile monthly data in to an existing JSON file that I loaded via import json. Initially, my json data just had one property which is 'name':
json_data['features'][1]['properties']
>>{'name':'John'}
But the end result with the monthly data I want is like this:
json_data['features'][1]['properties']
>>{'name':'John',
'2016-01': {'x1':0, 'x2':0, 'x3':1, 'x4':0},
'2016-02': {'x1':1, 'x2':0, 'x3':1, 'x4':0}, ... }
My monthly data are on separate tsv files. They have this format:
John 0 0 1 0
Jane 1 1 1 0
so I loaded them via import csv and parsed through a list of urls and set about placing them in a collective dictionary like so:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for i in strings:
with open(i) as f:
tsv_object = csv.reader(f, delimiter='\t')
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
I checked how things turned out by slicing collective_dict like so:
collective_dict['2016-01']['John'][0]
>>'0'
Which is correct; it just needs to be cast into an integer.
For my next feat, I attempted to assign all of the monthly data to the respective json members as part of their external properties:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
Here I got an arrow pointing at the last few characters:
Syntax Error: unexpected EOF while parsing
It is a pretty complicated slice, I suppose user error is not to be ruled out. However, I did double and triple check things. I also looked up this error. It seems to come up with input() related calls. I'm left a bit confused, I don't see how I made a mistake (although I'm already mentally prepared to accept that).
My only guess was that something somewhere was not a string. When I checked collective_dict and json_data, everything that was supposed to be a string was a string ('John', 'Jane' et all). So, I guess it's something else.
I made the problem as simple as I could while keeping the original structure of the data and for loops and so forth. I'm using Python 3.6.
Question
Why am I getting the EOF error? How can I build my external properties data without encountering such an error?

Here I have rewritten your last code block to:
for i in file_strings:
file_name = i[:-4]
for j in range(len(json_data['features'])):
name = json_data['features'][j]['properties']['name']
file_dict = json_data['features'][j]['properties'][file_name] = {}
for x in range(4):
x_string = 'x{}'.format(x+1)
file_dict[x_string] = int(collective_dict[file_name][name][x])
from:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
That is just to make it a bit more readable, but that shouldn't change anything.
A thing I noticed in your other part of code is the following:
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
The thing I refer to is the = rows[0]:rows[1:5] for rows in tsv_object part. In my IDE, that does not work, and I'm not sure if that is a typo in your question or of that is actually in your code, but I imagine you want it to actually be
collective_dict[i[:-4]] = {rows[0]:rows[1:5] for rows in tsv_object}
or something like that. I'm not sure if that could confuse the parser think that there is an error at the end of the file.
The ValueError: Invalid literal for int()
If your tsv-data is
John 0 0 1 0
Jane 1 1 1 0
Then it should be no problem to do int() of the string value. E.g.: int('42') will become an int with value 42. However, if you have an error in one, or several, lines of your files, then use something like this block of code to figure out which file and line it is:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for file_name in file_strings:
print('Reading {}'.format(file_name))
with open(file_name) as f:
tsv_object = csv.reader(f, delimiter='\t')
for line_no, (name, *x_values) in enumerate(tsv_object):
if len(x_values) != 4:
print('On line {}, there is only {} values!'.format(line_no, len(x_values)))
try:
intx = [int(x) for x in x_values]
except ValueError as e:
# Catch "Invalid literal for int()"
print('Line {}: {}'.format(line_no, e))

Read Dataset CSV with Line Feeds in Cells

We are using the following code to read a CSV file from the Application Server:
OPEN DATASET file_str FOR INPUT IN TEXT MODE ENCODING DEFAULT.
*--------------------------------------------------*
* process and display output
*--------------------------------------------------*
DO.
CLEAR: lv_record,idat.
READ DATASET file_str INTO lv_record.
IF sy-subrc NE 0.
EXIT.
ELSE.
The problem we encounter now is that the CSV file holds Line Feeds in the cells:
If we read it with the above code the read dataset splits it in the middle of the cell instead of in the end.
What is the best way of handling this? We tried to read the file with the line feeds and do a replace all but we can't seem to visualize the line feeds in read dataset.
Thanks for your help!

This is a standard string handling issue - nothing specific to ABAP, you would encounter the same issue with BufferedReader.readLine(). Just check whether the line is complete (either contains the correct number of fields, or contains an even number of (un-quoted) cell-delimiters, i. e. "), and if it doesn't, read the next line and append it with CL_ABAP_CHAR_UTILITES=>CR_LF, then repeat.

This is the solution:
OPEN DATASET file_str FOR INPUT IN TEXT MODE ENCODING DEFAULT.
*--------------------------------------------------*
* process and display output
*--------------------------------------------------*
DATA: len TYPE i.
DATA: test TYPE string.
DATA: lv_new TYPE i,
lv_last_char TYPE c.
DATA: lv_concat TYPE string.
DO.
CLEAR: lv_record,idat, lv_concat.
READ DATASET file_str INTO lv_record.
IF sy-subrc NE 0.
EXIT.
ELSE.
"-- Get the string length
CALL FUNCTION 'STRING_LENGTH'
EXPORTING
string = lv_record
IMPORTING
length = lv_new.
"-- Check if the string is ended correctly
lv_new = lv_new - 1.
lv_last_char = lv_record+lv_new(1).
IF lv_last_char EQ '"'.
CONTINUE.
ELSE.
"-- Read next line
CONCATENATE lv_concat lv_record INTO lv_concat.
CLEAR lv_record.
WHILE lv_last_char NE '"'.
READ DATASET file_str INTO lv_record.
CALL FUNCTION 'STRING_LENGTH'
EXPORTING
string = lv_record
IMPORTING
length = lv_new.
lv_new = lv_new - 1.
lv_last_char = lv_record+lv_new(1).
CONCATENATE lv_concat lv_record INTO lv_concat.
ENDWHILE.
ENDIF.
IF lv_concat IS NOT INITIAL.
CLEAR lv_record.
MOVE lv_concat TO lv_record.
ENDIF.

To save list in CSV file python?

I want to transpose row into column and then save words in CSV file. The problem is only last value of column after transpose is save in file, and if i append string with list, it save in file but characters not words.
Anyone help me to sort it. Thanks in advance
import re
import csv
app =[]
with open('afterstem.csv') as f:
words = [x.split() for x in f]
for x in zip(*words):
for y in x:
res=y
newstr = re.sub('"', r'', res)
app = app + list(res)
#print("AFTER" ,newstr)
with open(r"removequotes.csv", "w") as output:
writer = csv.writer(output, lineterminator='\n', delimiter='\t')
for val in app:
writer.writerow(val)
output.close()
The output save in file look like this:
But i want "Bank" in one cell.

Simply use
for column in zip(*words):
newrows = [[word.replace('"', '')] for word in column]
app.extend(newrows)
to put all columns one after another into the first column.
newrow = [[word.replace('"', '')] for word in column] creates a new list for each column with double quotes stripped and wrapped into a list and app.extend(newrow) appends all of these lists to your result variable app.
You got your result because of your inner loop and in particular its last line:
for y in x:
...
app = app + list(res)
The for-loop takes each word in each column and list(res) converts the string with the word into a list of characters. So "Bank" becomes ['B', 'a', 'n', 'k'], etc. Then app = app + list(res) creates a new list that contains every item from app and the characters from the word and assigns that to app.
In the end you got a array containing every letter from the file instead of a array with all words in the file in the right order. The call to writer.writerow(val) then wrote each letter as it's own row.
BTW: If your input also uses tabs to delimit columns, it might be easier to use list(csv.reader(f, lineterminator='\n', delimiter='\t')) instead of your simple read with split() and stripping of quotes.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to format TSV files to use with torchtext? - csv

The following code reads the TSV the way i formatted: mt_train = datasets.SequenceTaggingDataset(path='/path/to/file.tsv', fields=(('text', text), ('labels', labels))) It happens that SequenceTaggingDataset properly identifies an empty line as the sentence separator.

Related

GAMS csv read issue

how to iterate over xlsx data in octave with mixed types

EOF Error During Dict Slice

Read Dataset CSV with Line Feeds in Cells

To save list in CSV file python?

Categories

Resources