How do chain python generator expressions with filtering? - generator

I am trying to chain generators so I can process a large CSV file as a stream of lines rather than doing each single operation in a batch.
This way, I can delay each iteration of each step, avoiding loading an entire data set in memory.
Generator expressions work fine except if I put try to filter the output with an if statement.
This works: and only one iteration is consumed at a time
file_iterable = open("myfile.csv")
parsed_csv_iterable = (parse_i(i) for i in file_iterable)
then I can get a line at a time by calling next() on the resulting iterable. However, if I do this
file_iterable = open("myfile.csv")
parsed_csv_iterable = (parse_i(i) for i in file_iterable if parse_i(i)[0] in [1,2,3])
Then, the iterator keeps running until is exhausted. Why? What's the workaround?

Related

Read every nth batch in pyarrow.dataset.Dataset

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

How can I make a select function more performant in pyspark?

When I use the following function, it takes up to 10 seconds to execute. Is there any way to make it run quicker?
def select_top_20 (df, col):
most_data = df.groupBy(col).count().sort(f.desc("count"))
top_20_count = most_data.limit(20).drop("count")
top_20 = [row[col] for row in top_20_count.collect()]
return top_20
Hard to answer in general, the code seems fine to me.
It depends on how the input DataFrame was created:
if it was directly read from a data source (parquet, database or so), it is an I/O problem and there is not much you can do.
if the DataFrame went through some processing before the function is executed, you might inspect this part. Lazy evaluation in Spark means that all this processing is done from scratch when you execute this function (instead of only the commands listed in the function). I.e. reading the data from disk, processing, everything. Persisting or caching the DataFrame somewhere in-between might speed you up considerably.

Save function results when a script is executed

Pretty new to python. I have a machine learning script, and what I would like to do is, every time the script is run, I would like to save the results. But what I don't understand is if all the code is in one script, how to save the results without overwriting? So for example:
auc_score = cross_val_score(logreg_model, X_RFECV, y_vars, cv=kf, scoring='roc_auc').mean()
auc_scores=[]
def auc_log():
auc_scores.append(auc_score)
return(auc_scores)
auc_log()
Everytime I run this .py file, the auc_scores list will start with blank, and the list won't update until each time the function is executed, but if you run the whole script than obvious the above will execute and start the saved list as blank again. I feel this is fairly simple, just not thinking about this properly from a continuous deployment perspective. Thanks!
It might be as well to use each result list or zero list as variables of acc_log function, which can leave all function result.
For example,
auc_score=cross_val_score(logreg_model, X_RFECV, y_vars, cv=kf, scoring='roc_auc').mean()
#if auc_score is 'int' or 'float', you must conver it to list type
auc_score_=[]
auc_score_.append(auc_score)
auc_score_zero=[]
def acu_log(acu_score_1,auc_score_2):
acu_scores=acu_score_1+auc_score_2
return acu_scores
initial_log=acu_log(auc_score_zero, auc_score_)
#print (initial_log)
second_log=acu_log(initial_log, auc_score_)
#print (second_log)
If you want to save each acc_log list on HDD after returning the result at each step, 'pickle module' is convenient for treating it.
I’m not sure this is really what you want, but hope my answer contribute to solve your question

Octave: how to retrieve data from a Java ResultSet object?

I need to feed my Octave instance with data retrieved from an Oracle database.
I have implemented an OJDBC connection in my Octave instance an I am able now to put data from an Oracle database into a Java ResultSet object in Octave (taken from: https://lists.gnu.org/archive/html/help-octave/2011-08/msg00250.html):
javaaddpath('access-path-to-ojdbc8.jar') ;
props = javaObject('java.util.Properties') ;
props.setProperty("user", 'username') ;
props.setProperty("password", 'password') ;
driver = javaObject('oracle.jdbc.OracleDriver') ;
url = 'jdbc:oracle:thin:#ip:port:schema' ;
con = driver.connect(url, props) ;
sql = 'select-query' ;
ps = con.prepareStatement(sql) ;
rs = ps.executeQuery() ;
But haven't succeeded with retrieving data from that ResultSet.
How can I put data from a ResultSet object in Octave into an array or matrix?
Finding out what to do
The docs you want for ResultSet and related classes are in the Java JDBC API documentation. (You don't need the Oracle-specific doco unless you want to do fancy Oracle-specific stuff. All JDBC drivers conform to the generic JDBC API.) Have a look at that and any JDBC tutorial; because it is a Java object, you'll use all the same method calls from Octave that you would from Java code.
For conversion to Octave values, know that Java primitives convert to Octave types automatically, java.lang.String objects require conversion by calling char(...) on them, and java.sql.Date values you will have to convert to datenums manually. (Lazy way is to get their string values and parse them; fast way is to get their Unix time values and convert numerically.)
What to do
Because Java JDBC advances the result set cursor one row at a time, and requires a separate method call to get the value for each column, you need to use a pair of nested loops to iterate over the ResultSet. Like this:
rsMeta = rs.getMetaData();
nCols = rsMeta.getColumnCount();
data = NaN(1, nCols);
iRow = 0;
while rs.next()
iRow = iRow + 1;
for iCol = 1:nCols
data(iRow,iCol) = rs.getDouble(iCol);
endfor
endwhile
Ah, but what if your columns aren't all numerics? Then you'll need to look at the column type in rsMeta, switch on it, and use a cell array to hold the heterogeneous data set. Like this:
rsMeta = rs.getMetaData();
nCols = rsMeta.getColumnCount();
data = cell(1, nCols);
iRow = 0;
while rs.next()
iRow = iRow + 1;
for iCol = 1:nCols
colTypeId = rsMeta.getColumnType(iCol);
switch colTypeId
case NUMERIC_TYPE
data{iRow,iCol} = rs.getDouble(iCol);
case CHAR_TYPE
data{iRow,iCol} = rs.getString(iCol);
data{iRow,iCol} = char(data{iRow,iCol});
# ... and so on ...
otherwise
error('Unsupported SQL data type in column %d: %d', ...
iCol, colTypeId);
endswitch
endfor
endwhile
How do you know what the values for NUMERIC_TYPE, CHAR_TYPE, and so on should be? You have to examine the values in the java.sql.Types Java class. Do that at run time to make sure you're consistent with the JDK you're running against.
(Note: this code is the easy, sloppy way of doing it. There's all sorts of improvements and optimizations you could (and should) do on it.)
How to go fast
Unfortunately, the performance of this is going to suck big time, because Java method calls from Octave are expensive, and cells are in inefficient way of holding data. If your result sets are large, in order to get good performance, what you need to do is write a result set buffering layer in Java that runs the loops in Java and buffers the results in primitive per-column arrays, and use that. If you want an example of how to do this, I have an example implementation in Matlab in my Janklab library (M-code layer here). Feel free to steal the code. Octave doesn't support dot-referencing of Java constructors or class methods, so to convert it to Octave, you'd need to replace all those with javaObject and javaMethod calls. (That's tedious and results in ugly code, so I'm not going to do it myself. Sorry.)
If you're not willing to do that (and really, who is?), and still need good performance, what you should actually do is forget about connecting Octave directly to Oracle, and write a separate Python/NumPy or R program that takes your query, runs it against your Oracle db, and writes the result to a .mat file that you will then read from Octave.
I don't have access to the specified .jar or a suitable database to test your specific code, but in any case, this isn't really a problem of octave. Effectively you need the relevant api for the ResultSet class, and a standard approach for processing it. The oracle documentation suggests that in java you'd do something like this:
while (rs.next()) { System.out.println (rs.getString(1)); }
So, presumably this is exactly what you'll do in octave too, except via octave's java interface. One possible way this might look like is
while rs.next().booleanValue % since a Boolean java object by itself
% isn't valid logical input for octave's
% 'while' statement
% do something with rs, e.g. fill in a cell array
endwhile
As for whether you can automatically convert a java array to an octave cell-object or vice-versa, as far as I know this is not possible. You'd have to set / get elements from one to the other via a for loop, just like you'd do in java (e.g. see the note in the manual regarding the javaArray function)

How can I set an expression to the FileSpec property on Foreach File enumerator?

I'm trying to create an SSIS package to process files from a directory that contains many years worth of files. The files are all named numerically, so to save processing everything, I want to pass SSIS a minimum number, and only enumerate files whose name (converted to a number) is higher than my minimum.
I've tried letting the ForEach File loop enumerate everything and then exclude files in a Script Task, but when dealing with hundreds of thousands of files, this is way too slow to be suitable.
The FileSpec property lets you specify a file mask to dictate which files you want in the collection, but I can't quite see how to specify an expression to make that work, as it's essentially a string match.
If there's an expression within the component somewhere which basically says Should I Enumerate? - Yes / No, that would be perfect. I've been experimenting with the below expression, but can't find a property to which to apply it.
(DT_I4)REPLACE( SUBSTRING(#[User::ActiveFilePath],FINDSTRING( #[User::ActiveFilePath], "\", 7 ) + 1 ,100),".txt","") > #[User::MinIndexId] ? "True" : "False"
Here is one way you can achieve this. You could use Expression Task combined with Foreach Loop Container to match the numerical values of the file names. Here is an example that illustrates how to do this. The sample uses SSIS 2012.
This may not be very efficient but it is one way of doing this.
Let's assume there is a folder with bunch of files named in the format YYYYMMDD. The folder contains files for the first day of every month since 1921 like 19210101, 19210201, 19210301 .... all the upto current month 20121101. That adds upto 1,103 files.
Let's say the requirement is only to loop through the files that were created since June 1948. That would mean the SSIS package has to loop through only the files greater than 19480601.
On the SSIS package, create the following three parameters. It is better to configure parameters for these because these values are configurable across environment.
ExtensionToMatch - This parameter of String data type will contain the extension that the package has to loop through. This will supplement the value to FileSpec variable that will be used on the Foreach Loop container.
FolderToEnumerate - This parameter of String data type will store the folder path that contains the files to loop through.
MinIndexId - this parameter of Int32 data type will contain the minimum numerical value above which the files should match the pattern.
Create the following four parameters that will help us loop through the files.
ActiveFilePath - This variable of String data type will hold the file name as the Foreach Loop container loops through each file in the folder. This variable is used in the expression of another variable. To avoid error, set it to a non-empty value, say 1.
FileCount - This is a dummy variable of Int32 data type will be used for this sample to illustrate the number of files that the Foreach Loop container will loop through.
FileSpec - This variable of String data type will hold the file pattern to loop through. Set the expression of this variable to below mentioned value. This expression will use the extension specified on the parameters. If there are no extensions, it will *.* to loop through all files.
"*" + (#[$Package::ExtensionToMatch] == "" ? ".*" : #[$Package::ExtensionToMatch])
ProcessThisFile - This variable of Boolean data type will evaluate whether a particular file matches the criteria or not.
Configure the package as shown below. Foreach loop container will loop through all the files matching the pattern specified on the FileSpec variable. An expression specified on the Expression Task will evaluate during runtime and will populate the variable ProcessThisFile. The variable will then be used on the Precedence constraint to determine whether to process the file or not.
The script task within the Foreach loop container will increment the counter of variable FileCount by 1 for each file that successfully matches the expression.
The script task outside the Foreach loop will simply display how many files were looped through by the Foreach loop container.
Configure the Foreach loop container to loop through the folder using the parameter and the files using the variable.
Store the file name in variable ActiveFilePath as the loop passes through each file.
On the Expression task, set the expression to the following value. The expression will convert the file name without the extension to a number and then will check if it evaluates to greater than the given number in the parameter MinIndexId
#[User::ProcessThisFile] = (DT_BOOL)((DT_I4)(REPLACE(#[User::ActiveFilePath], #[User::FileSpec] ,"")) > #[$Package::MinIndexId] ? 1: 0)
Right-click on the Precedence constraint and configure it to use the variable ProcessThisFile on the expression. This tells the package to process the file only if it matches the condition set on the expression task.
#[User::ProcessThisFile]
On the first script task, I have the variable User::FileCount set to the ReadWriteVariables and the following C# code within the script task. This increments the counter for file that successfully matches the condition.
public void Main()
{
Dts.Variables["User::FileCount"].Value = Convert.ToInt32(Dts.Variables["User::FileCount"].Value) + 1;
Dts.TaskResult = (int)ScriptResults.Success;
}
On the second script task, I have the variable User::FileCount set to the ReadOnlyVariables and the following C# code within the script task. This simply outputs the total number of files that were processed.
public void Main()
{
MessageBox.Show(String.Format("Total files looped through: {0}", Dts.Variables["User::FileCount"].Value));
Dts.TaskResult = (int)ScriptResults.Success;
}
When the package is executed with MinIndexId set to 1948061 (excluding this), it outputs the value 773.
When the package is executed with MinIndexId set to 20111201 (excluding this), it outputs the value 11.
Hope that helps.
From investigating how the ForEach loop works in SSIS (with a view to creating my own to solve the issue) it seems that the way it works (as far as I could see anyway) is to enumerate the file collection first, before any mask is specified. It's hard to tell exactly what's going on without seeing the underlying code for the ForEach loop but it seems to be doing it this way, resulting in slow performance when dealing with over 100k files.
While #Siva's solution is fantastically detailed and definitely an improvement over my initial approach, it is essentially just the same process, except using an Expression Task to test the filename, rather than a Script Task (this does seem to offer some improvement).
So, I decided to take a totally different approach and rather than use a file-based ForEach loop, enumerate the collection myself in a Script Task, apply my filtering logic, and then iterate over the remaining results. This is what I did:
In my Script Task, I use the asynchronous DirectoryInfo.EnumerateFiles method, which is the recommended approach for large file collections, as it allows streaming, rather than having to wait for the entire collection to be created before applying any logic.
Here's the code:
public void Main()
{
string sourceDir = Dts.Variables["SourceDirectory"].Value.ToString();
int minJobId = (int)Dts.Variables["MinIndexId"].Value;
//Enumerate file collection (using Enumerate Files to allow us to start processing immediately
List<string> activeFiles = new List<string>();
System.Threading.Tasks.Task listTask = System.Threading.Tasks.Task.Factory.StartNew(() =>
{
DirectoryInfo dir = new DirectoryInfo(sourceDir);
foreach (FileInfo f in dir.EnumerateFiles("*.txt"))
{
FileInfo file = f;
string filePath = file.FullName;
string fileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);
int jobId = Convert.ToInt32(fileName.Substring(0, fileName.IndexOf(".txt")));
if (jobId > minJobId)
activeFiles.Add(filePath);
}
});
//Wait here for completion
System.Threading.Tasks.Task.WaitAll(new System.Threading.Tasks.Task[] { listTask });
Dts.Variables["ActiveFilenames"].Value = activeFiles;
Dts.TaskResult = (int)ScriptResults.Success;
}
So, I enumerate the collection, applying my logic as files are discovered and immediately adding the file path to my list for output. Once complete, I then assign this to an SSIS Object variable named ActiveFilenames which I'll use as the collection for my ForEach loop.
I configured the ForEach loop as a ForEach From Variable Enumerator, which now iterates over a much smaller collection (Post-filtered List<string> compared to what I can only assume was an unfiltered List<FileInfo> or something similar in SSIS' built-in ForEach File Enumerator.
So the tasks inside my loop can just be dedicated to processing the data, since it has already been filtered before hitting the loop. Although it doesn't seem to be doing much different to either my initial package or Siva's example, in production (for this particular case, anyway) it seems like filtering the collection and enumerating asynchronously provides a massive boost over using the built in ForEach File Enumerator.
I'm going to continue investigating the ForEach loop container and see if I can replicate this logic in a custom component. If I get this working I'll post a link in the comments.
The best you can do is use FileSpec to specify a mask, as you said. You could include at least some specs in it, like files starting with "201" for 2010, 2011 and 2012. Then, in some other task, you could filter out those you don't want to process (for instance, 2010).