Big CSV parsing [closed]

Big CSV parsing [closed] - csv

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm having quite a hard time to figure out a robust and light algorithm for post-processing some big CSV files. Here's a minimal example of how they look like:
Time a b c
0 2.9 1.6 4.1
0 3.6 1.1 0.5
0 3.4 0.2 1.7
1.2 0.1 4.2 1.9
1.201 2.3 3.1 4.8
9.99 0.2 0.8 1.2
10 3.1 3.3 2.3
10 3.6 3.5 3.0
10.01 1.1 4.5 3.9
10.01 2.2 3.0 2.3
17 4.3 2.3 3.8
20 1.0 3.2 3.0
30 4.1 3.0 4.9
40 3.8 3.3 1.6
I need to postprocess my CSV based on those rules:
only lines whose time is a multiple of 10 need to be considered
if multiple line have the same time tamp, take the average value of each column across the different rows
Here's the output I'd like to get:
Time a b c
0 3.3 0.97 2.1
10 2.04 3.02 2.54
20 1.0 3.2 3.0
30 4.1 3.0 4.9
40 3.8 3.3 1.6
Now the constraint: my script needs to handle pretty big CSV (up to few hundreds MB) on a Windows machine without much memory available for this. Because of that, I'm not keen on storing all the CSV in a big array of dictionary, but I'd prefer to do this row by row.
Here's my first naive attempt. It's very poor and not properly working. (Small margin note: the average is not a true average but a kind of weird "running average". Bear with me here, I was trying to assess the workflow and don't really care for the numbers at this stage.)
filename = "test"
sampling_time = 10.0
tolerance = 1e-1
Dim FSO, input, output
Const ForReading = 1
Const ForWriting = 2
'Create the objects
Set FSO = CreateObject("Scripting.FileSystemObject")
Set input = FSO.OpenTextFile(filename & ".csv", ForReading, False)
Set output = FSO.OpenTextFile(filename & "_output.csv", ForWriting, True)
'First line: write headers
s = input.ReadLine()
output.WriteLine s
'Second line: initialize sSplit_old
s = input.ReadLine()
sSplit = Split(s, ",")
sSplit_old = sSplit
'Keep reading...
Do Until input.AtEndOfStream
'read new line and split it into its components
'this is needed to read the first element of the line, i.e. the time
s = input.ReadLine()
sSplit = Split(s, ",")
'If the remainder of time/sampling_time is below the tolerance then the
'line has to be processed.
'Here the "\" operator (i.e. the integer division: 5\2=2, while 5/2=2.5)
'is used as the "Mod" operator return integer remainders.
If CDbl(sSplit(0))-sampling_time*(CDbl(sSplit(0))\sampling_time) < tolerance Then
'If the current time is close to the previous one (within a tolerance)...
If Abs(CDbl(sSplit(0))-CDbl(sSplit_old(0))) < tolerance Then
'... cycle through the arrays and store the average
For i = 0 To UBound(sSplit)
sSplit_old(i) = (CDbl(sSplit(i)) + CDbl(sSplit_old(i))) / 2.0
Next
Else
'... otherwise just write the previous time and save the current
'one to compare it to the next one
s = Join(sSplit_old, ",")
output.WriteLine s
sSplit_old = sSplit
End If
End If
Loop
output.WriteLine s
input.Close
output.Close

When you paid (too much) for your Windows OS, you also paid for a SQL engine. So use it:
Option Explicit
Dim db : Set db = CreateObject("ADODB.Connection")
Dim dd : dd = "E:\work\proj\soa\47155733\data"
Dim cs
If "AMD64" = CreateObject("WScript.Shell").ExpandEnvironmentStrings("%PROCESSOR_ARCHITECTURE%") Then
cs = "Driver=Microsoft Access Text Driver (*.txt, *.csv);Dbq=" & dd & ";Extensions=asc,csv,tab,txt;"
WScript.Echo "64 Bit:", cs
Else
cs = "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" & dd & ";Extensions=asc,csv,tab,txt;"
WScript.Echo "32 Bit:", cs
End If
db.Open cs
Dim ss : ss = "SELECT * FROM [47155733.txt]"
WScript.Echo ss
WScript.Echo db.Execute(ss).GetString(2,,vbTab,vbCrlf,"*")
ss = "SELECT t, avg(a), avg(b), avg(c) FROM [47155733.txt]" _
& " WHERE t = Int(t) And 0.0 = t Mod 10 GROUP BY t"
WScript.Echo ss
WScript.Echo db.Execute(ss).GetString(2,,vbTab,vbCrlf,"*")
ss = "SELECT Round(1/3, 3)"
WScript.Echo ss
WScript.Echo db.Execute(ss).GetString(2,,vbTab,vbCrlf,"*")
output:
cscript 47155733.vbs
SELECT * FROM [47155733.txt]
0 2,9 1,6 4,1
0 3,6 1,1 0,5
0 3,4 0,2 1,7
1,2 0,1 4,2 1,9
1,201 2,3 3,1 4,8
9,99 0,2 0,8 1,2
10 3,1 3,3 2,3
10 3,6 3,5 3
10,01 1,1 4,5 3,9
10,01 2,2 3 2,3
17 4,3 2,3 3,8
20 1 3,2 3
30 4,1 3 4,9
40 3,8 3,3 1,6
SELECT t, avg(a), avg(b), avg(c) FROM [47155733.txt] WHERE t = Int(t) And 0.0 = t Mod 10 GROUP BY t
0 3,3 0,966666666666667 2,1
10 3,35 3,4 2,65
20 1 3,2 3
30 4,1 3 4,9
40 3,8 3,3 1,6
SELECT Round(1/3, 3)
0,333
Tested for 32 and 64 bit on Windows 10; German locale. I prefer to specify the file format in a schema.ini file:
[47155733.txt]
Format=Delimited(,)
ColNameHeader=True
DecimalSymbol=.
Col1=t Double
Col2=a Double
Col3=b Double
Col4=c Double
Background:
Connectionstrings, odbc Connectionstrings, Driver download.

Related

Why happened out of bound error in for loop?

Out of bound error occured.
This is Octave language.
for ii=1:1:10
m(ii)=ii*8
q=m(ii)
if (ii>=2)
q(ii).xdot=(q(ii).x-q(ii-1).x)/Ts;
end
end
But error says
q(2): out of bound 1
How can I fixed it?

For this type of assignment you do not need a loop and anyway you need to define Ts.
To calculate differential increase you can use diff
x=(1:1:10)*8
x =
8 16 24 32 40 48 56 64 72 80
octave:5> Ts=2
Ts = 2
octave:6> xdot=diff(x)/Ts
xdot =
4 4 4 4 4 4 4 4 4
octave:7> size(x)
ans =
1 10
octave:8> size(xdot)
ans =
1 9

SSRS running total on percentage, show based on only every 10th value

In SSRS report I have a field with percentage values from which I have calculated cumulative running total. On the third field user wants to see only values which are closest to every ten value and blank out everything else.
So in the example we show 8 as cumulative value, as it is closest to 10. For the second value we choose 20 as it is closest value to 20. For the third we take 32, closest to 30. then 40, 52 , 62, 73, 79, 91
% cumulative val showed values
3 3
5 8 8
6 14
3 17
2 19
1 20 20
4 24
3 27
5 32 32
7 39
1 40 40
2 42
2 44
3 47
5 52 52
2 54
3 57
1 58
4 62 62
3 65
1 66
7 73 73
2 75
1 76
3 79 79
4 83
2 85
3 88
1 89
2 91 91
I have tried to use the solution with different set of records and that's what I can see
Result Set

Whenever I find questions like this in reporting-services tag I think about the possibilities users could have if there were something like a NEXT() function, contrary to the PREVIOUS() supported function.
Okay, I vented.
I recreated the table you posted in the question and added row number to the dataset.
Go to Report menu / Report Properties / Code tab, in the text area put the following code.
Dim running_current as Integer = 0
Dim flag as Integer = 0
Public Function Visible(ByVal a as Integer,ByVal b as Integer) As String
running_current = running_current + a
Dim running_next as Integer = running_current + b
Dim a_f as Integer = CInt(Math.ceiling(running_current / 10.0)) * 10
Dim b_f as Integer = CInt(Math.ceiling(running_next / 10.0)) * 10
if flag = 1 Then
flag = 0
return "true"
End If
If a_f = b_f Then
return "false"
Else
IF Closest(running_current,running_next) = "false" Then
flag = 1
return "false"
End If
return "true"
End if
End Function
Public Function Closest(ByVal a as Integer, ByVal b as Integer) as String
Dim target as Integer = CInt(Math.ceiling(a / 10.0)) * 10
IF Math.abs(a-target)>= Math.abs(b-target) Then
return "false"
Else
return "true"
End IF
End Function
This function compares every value with the next based on the row number in order to determine if it must be showed or not. If the value must be showed it returns true, otherwise false string is returned.
You have to pass the value to the function and the next to it, based on the row number. The lookup() function plus row number give us a similar behaviour to a NEXT() function.
=IIf(
Code.Visible(Fields!Value.Value,
Lookup(Fields!RowNumber.Value+1,Fields!RowNumber.Value,Fields!Value.Value,"DataSet11")),
ReportItems!Textbox169.Value,
Nothing
)
For the first row it will pass 3 and 5 to determine if 3 must be showed, then 5 and 6 to determine if the cumulative value must be showed, and so on.
I've created another column as your example, and used the above expression. It says if the function Code.Visible function returns true show the Textbox169 (cumulative value column) otherwise show Nothing.
For Running column I've used the typical running value expression:
=RunningValue(Fields!Value.Value,Sum,"DataSet11")
The result is something like this:
Let me know if this helps.

Select records based on the specific index string value and then remove subsequent fields by python

I have a .csv file named file01.csv that contains many records. Some records are required and some are not. I find that the required records has a string variable “Mi”, but it is not exist into the unnecessary records. So, I want to select the required records based on string value “Mi” in the field for every records.
Finally I want to delete the subsequent fields of each record from the field that contains value “Mi”. Any suggestion and advice is appreciated.
Optional:
In addition, I want to delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
My fileO.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My Expected results files (outFile.csv):
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46

The following approach should work fine using Python csv module:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def find_mi(row):
for index, col in enumerate(row):
if col.find('Mi') != -1:
return index
return -1
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
f_input = open('fileO.csv', 'rb')
f_output = open('outFile.csv', 'wb')
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
#print '%2d %s' % (len(row), row)
if len(row) >= 2:
bb = re.match(r'(\d+)__(\d+).0\.csv', row[1])
mi = find_mi(row)
if bb and mi != -1:
row[:] = row[:mi] + [''] * (len(row) - mi)
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
f_input.close()
f_output.close()
outFile.csv will contain the following:
a_id,b_id,CC,DD,EE,FF,GG
1,1,0,10,27,57,
1,3,0,10,27,,
1,6,0,10,,,
7,9,0,26,46,,
Tested using Python 2.6.6

Is it possible to use logarithms to convert numbers to binary?

I'm a CS freshman and I find the division way of finding a binary number to be a pain. Is it possible to use log to quickly find 24, for instance, in binary?

If you want to use logarithms, you can.
Define log2(b) as log(b) / log(2) or ln(b) / ln(2) (they are the same).
Repeat the following:
Define n as the integer part of log2(b). There is a 1 in the nth position in the binary representation of b.
Set b = b - 2n
Repeat first step until b = 0.
Worked example: Converting 2835 to binary
log2(2835) = 11.47.. => n = 11
The binary representation has a 1 in the 211 position.
2835 - (211 = 2048) = 787
log2(787) = 9.62... => n = 9
The binary representation has a 1 in the 29 position.
787 - (29 = 512) = 275
log2(275) = 8.10... => n = 8
The binary representation has a 1 in the 28 position.
275 - (28 = 256) = 19
log2(19) = 4.25... => n = 4
The binary representation has a 1 in the 24 position.
19 - (24 = 16) = 3
log2(3) = 1.58.. => n = 1
The binary representation has a 1 in the 21 position.
3 - (21 = 2) = 1
log2(1) = 0 => n = 0
The binary representation has a 1 in the 20 position.
We know the binary representation has 1s in the 211, 29, 28, 24, 21, and 20 positions:
2^ 11 10 9 8 7 6 5 4 3 2 1 0
binary 1 0 1 1 0 0 0 1 0 0 1 1
so the binary representation of 2835 is 101100010011.

From a CS perspective, binary is quite easy because you usually only need to go up to 255. Or 15 if using HEX notation. The more you use it, the easier it gets.
How I do it on the fly, is by remembering all the 2 powers up to 128 and including 1. (The presence of the 1 instead of 1.4xxx possibly means that you can't use logs).
128,64,32,16,8,4,2,1
Then I use the rule that if the number is bigger than each power in descending order, that is a '1' and subtract it, else it's a '0'.
So 163
163 >= 128 = '1' R 35
35 !>= 64 = '0'
35 >= 32 = '1' R 3
3 !>= 16 = '0'
3 !>= 8 = '0'
3 !>= 4 = '0'
3 >= 2 = '1' R 1
1 >= 1 = '1' R 0
163 = 10100011.
It may not be the most elegant method, but when you just need to convert something ad-hoc thinking of it as comparison and subtraction may be easier than division.

Yes, you have to loop through 0 -> power which is bigger than you need and then take the remainder and do the same, which is a pain too.
I would suggest you trying recursion approach of division called 'Divide and Conquer'.
http://web.stanford.edu/class/archive/cs/cs161/cs161.1138/lectures/05/Small05.pdf
But again, since you need a binary representation, I guess unless you use ready utils, division approach is the simplest one IMHO.

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?

There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Big CSV parsing [closed] - csv

Related

Why happened out of bound error in for loop?

SSRS running total on percentage, show based on only every 10th value

Select records based on the specific index string value and then remove subsequent fields by python

Is it possible to use logarithms to convert numbers to binary?

standard unambiguos format [R] MySQL imported data

Categories

Resources