I've a data file (data.mat) in this format :
load('data.mat')
>> disp(X)
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 0
1 1 1 0
1 1 1 0
1 1 0 0
1 1 0 0
1 1 0 0
>> disp(y)
1
1
1
2
2
2
3
3
3
I'm attempting to create a text datafile(s) that octave can convert to .mat
xvalues.txt :
1, 1, 1, 1
1, 1, 1, 1
1, 1, 1, 1
1, 1, 1, 0
1, 1, 1, 0
1, 1, 1, 0
1, 1, 0, 0
1, 1, 0, 0
1, 1, 0, 0
yvalues.txt :
1
1
1
2
2
2
3
3
3
To instruct Octave to read the xvalues.txt into variable X and yvalues.txt into variable y I use :
X = xvalues.txt
y = yvalues.txt
Is this the idomatic method of reading files in Octave ? These data files will contain 10^6 rows of data at some point, is there a more performant method of loading data files ?
In the following code, each column of the xvalues file is read as separate vector, then combined in a matrix X:
[a,b,c,d] = textread("xvalues.txt", "%d %d %d %d", 'delimiter', ',', "endofline", "\n");
X = [a, b, c, d];
[y] = textread("yvalues.txt", "%d", "endofline", "\n");
disp(X);
disp(y);
Check the reference for the textread here
No need to convert the text files.
You can use the function dlmread():
data = dlmread(file, sep, r0, c0)
Read the matrix data from a text file which uses the delimiter sep between data values.
If sep is not defined the separator between fields is determined from the file itself.
Given two scalar arguments r0 and c0, these define the starting row and column of the data to be read. These values are indexed from zero, such that the first row corresponds to an index of zero.
So simply writing
X = dlmread('xvalues.txt');
y = dlmread('yvalues.txt');
does the job.
Note that Octave can infer the separator (',' in your case) from the file.
Related
Let's say I've got a matrix with n columns, and I've got n different functions.
Is it possible to apply i-th function per each element in i-th column efficiently, that is without using loop?
For example for the following variables:
funs = #(x) [x, cos(x), x.^2]
A = [1 0 1
2 0 2
3 0 3
4 0 4] ;
I would like to obtain the following result:
B = [1 1 1
2 1 4
3 1 9
4 1 16] ;
without looping through columns...
I'm trying to apply the countvectorizer to a dataframe containing bigrams to convert it into a frequency matrix showing the number of times each bigram appears in each row but I keep getting error messages.
This is what I tried using
cereal['bigrams'].head()
0 [(best, thing), (thing, I), (I, have),....
1 [(eat, it), (it, every), (every, morning),...
2 [(every, morning), (morning, my), (my, brother),...
3 [(I, have), (five, cartons), (cartons, lying),...
.........
bow = CountVectorizer(max_features=5000, ngram_range=(2,2))
train_bow = bow.fit_transform(cereal['bigrams'])
train_bow
Expected results
(best,thing) (thing, I) (I, have) (eat,it) (every,morning)....
0 1 1 1 0 0
1 0 0 0 1 1
2 0 0 0 0 1
3 0 0 1 0 0
....
I see you are trying to convert a pd.Series into a count representation of each term.
Thats a bit different from what CountVectorizer does;
From the function description:
Convert a collection of text documents to a matrix of token counts
The official example of case use is:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
So, as one can see, it takes as input a list where each term is a "document".
Thats problaby the cause of the errors you are getting, you see, you are passing a pd.Series where each term is a list of tuples.
For you to use CountVectorizer you would have to transform your input into the proper format.
If you have the original corpus/text you can easily implement CountVectorizer on top of it (with the ngram parameter) to get the desired result.
Else, best solution wld be to treat it as it is, a series with a list of items, which must be counted/pivoted.
Sample workaround:
(it wld be a lot easier if you just use the text corpus instead)
Hope it helps!
I am not sure I am describing the problem using the correct terms, my math English is not that good.
What I need to do is check if they match for each digit of two integers based on the position of the digit: ones, tens, .. etc
For example check the following table of different numbers and the wanted comparison result:
number1 | number2 | desired result
-----------------------------------
100 | 101 | 001
443 | 143 | 300
7001 | 8000 | 1001
6001 | 8000 | 2001
19 | 09 | 10
Basically I need the absolute value of subtraction for each digit alone. So for the first example:
1 0 0
1 0 1 -
--------
0 0 1
And second:
4 4 3
1 4 3 -
-------
3 0 0
And third:
7 0 0 1
8 0 0 0 -
---------
1 0 0 1
This needs to be done in mysql. Any ideas please?
This should do the job if your numbers are below 10000.
If they exceed, simply modify the query ;)
SELECT number1,
number2,
REVERSE(CONCAT(ABS(SUBSTRING(REVERSE(number1), 1, 1) - SUBSTRING(REVERSE(number2), 1, 1)),
IF(CHAR_LENGTH(number1) > 1, ABS(SUBSTRING(REVERSE(number1), 2, 1) - SUBSTRING(REVERSE(number2), 2, 1)), ''),
IF(CHAR_LENGTH(number1) > 2, ABS(SUBSTRING(REVERSE(number1), 3, 1) - SUBSTRING(REVERSE(number2), 3, 1)), ''),
IF(CHAR_LENGTH(number1) > 3, ABS(SUBSTRING(REVERSE(number1), 4, 1) - SUBSTRING(REVERSE(number2), 4, 1)), ''))) as `desired result`
FROM numbers
for 3 digit numbers:
SELECT number1,
number2,
CONCAT(
ABS(SUBSTRING(number1, 1, 1) - SUBSTRING(number2, 1,1)),
ABS(SUBSTRING(number1, 2, 1) - SUBSTRING(number2, 2,1)),
ABS(SUBSTRING(number1, 3, 1) - SUBSTRING(number2, 3,1))
)
FROM numbers
actually you don't have reverse the string at all. this comes from a more mathematical approach I tried before ;)
if you want to do it with integers only, it can be done this way (for 5 digits as an example):
select abs(number1/10000 - number2/10000) * 10000 +
abs(number1/1000 % 10 - number2/100 % 10) * 1000 +
abs(number1/100 % 10 - number2/100 % 10) * 100 +
abs(number1/10 % 10 - number2/10 % 10) * 10 +
abs(number1 % 10 - number2 % 10)
I have several text files that I need to import into MySQL, but they don't have any delimiters, and 3 lines in the text file represent one record.
When I try to import it everything goes into one column. Please see an example below
00003461020000001ACH1 00000000 00000000000 00000000 000000005011025708084 0 00 00 000000000000000000000 00000000241523551MA00
You need a helper table first.
CREATE TABLE tmpHelperTable(
your_data varchar(255),
a int,
b int
);
Then you need two user defined variables while loading your data.
SET #va = 0;
SET #vb = 0;
LOAD DATA INFILE 'your_data_file.csv'
INTO tmpHelperTable
LINES TERMINATED BY '\n'
(your_data, a, b)
SET a = #va := IF(#va = 3, 1, #va + 1),
b = IF(#va % 3 = 0, #vb := #vb + 1, #vb);
This line
SET a = #va := IF(#va = 3, 1, #va + 1),
is just an incrementing value, that resets when it reaches 3 (or whatever many lines determine one case).
The line
b = IF(#va = 1, #vb := #vb + 1, #vb);
just increments its value every time the previous variable got reset. We need this so we can group by it. Then you have a table like this:
your_data | a | b
xxxxxx 1 1
yyyyyy 2 1
zzzzzz 3 1
aaaaaa 1 2
bbbbbb 2 2
cccccc 3 2
dddddd 1 3
...
Then all you have to do is to pivot the table into your final table.
CREATE TABLE final_table(
id int,
data_1 varchar(255),
data_2 varchar(255),
data_3 varchar(255)
);
INSERT INTO final_table
SELECT
b,
MAX(IF(a = 1, your_data, NULL)),
MAX(IF(a = 2, your_data, NULL)),
MAX(IF(a = 3, your_data, NULL)),
FROM
tmpHelperTable
GROUP BY b;
I have a data frame that contains 508383 rows. I am only showing the first 10 row.
0 1 2
0 chr3R 4174822 4174922
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
I want to iterate through each row and check the value of column #2 of the first row to the value of the next row. I want to check if the difference between these values is less than 5000. If the difference is greater than 5000 then I want to slice the data frame from the first row to the previous row and have this be a subset data frame.
I then want to repeat this process and create a second subset data frame. I've only manage to get this done by using CSV reader in combination with Pandas.
Here is my code:
#!/usr/bin/env python
import pandas as pd
data = pd.read_csv('sort_cov_emb_sg.bed', sep='\t', header=None, index_col=None)
import csv
file = open('sort_cov_emb_sg.bed')
readCSV = csv.reader(file, delimiter="\t")
first_row = readCSV.next()
print first_row
count_1 = 0
while count_1 < 100000:
next_row = readCSV.next()
value_1 = int(next_row[1]) - int(first_row[1])
count_1 = count_1 + 1
if value_1 < 5000:
continue
else:
break
print next_row
print count_1
print value_1
window_1 = data[0:63]
print window_1
first_row = readCSV.next()
print first_row
count_2 = 0
while count_2 < 100000:
next_row = readCSV.next()
value_2 = int(next_row[1]) - int(first_row[1])
count_2 = count_2 + 1
if value_2 < 5000:
continue
else:
break
print next_row
print count_2
print value_2
window_2 = data[0:74]
print window_2
I wanted to know if there is a better way to do this process )without repeating the code every time) and get all the subset data frames I need.
Thanks.
Rodrigo
This is yet another example of the compare-cumsum-groupby pattern. Using only rows you showed (and so changing the diff to 100 instead of 5000):
jumps = df[2] > df[2].shift() + 100
grouped = df.groupby(jumps.cumsum())
for k, group in grouped:
print(k)
print(group)
produces
0
0 1 2
0 chr3R 4174822 4174922
1
0 1 2
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
2
0 1 2
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
This works because the comparison gives us a new True every time a new group starts, and when we take the cumulative sum of that, we get what is effectively a group id, which we can group on:
>>> jumps
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: 2, dtype: bool
>>> jumps.cumsum()
0 0
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: 2, dtype: int32