How to use pyarrow record batch to communicate across different processes or even different hosts - pyarrow

I read about recordBatch in pyarrow and am very interested. (I am a novice in the pyarrow world). I am wondering if I could use it to communicate between two different processes or different hosts. I am confused because all the examples/documentation I would find online are use cases within the same process, e.g. https://arrow.apache.org/docs/python/ipc.html and https://wesmckinney.com/blog/arrow-streaming-columnar/ Basically, they are doing this:
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, batch.schema)
writer.write_batch(batch)
writer.close()
source = sink.getvalue()
reader = pa.ipc.open_stream(source)
# reader do job...
Two questions:
If I have two different processes or two different hosts, one writes and another one reads, how do I pass the "source" or some kind of handle? I tried with a vanilla file but it doesn't seem to work. Maybe I am doing sth wrong. For example,
writer = pa.ipc.new_file('tmp.record_batch', batch.schema)
writer.write_batch(batch)
# works so far, but next:
reader = pa.ipc.open_file('tmp.record_batch')
ArrowInvalid: Not an Arrow file
Is there a SWMR (single-write-multiple-non-blocking-reader) mode?
Thanks in advance!

The examples are demonstrating in-process for simplicity, but they are working with the Arrow IPC format, which you can send between processes or machines. Think of this in terms of a Parquet, Avro, or CSV file - you can write this out in one process, then read it in another process. It's up to you how these processes communicate otherwise.
For your attempt, you need to close the file to write out the footer before you can open it in another process. You could also try the stream format which doesn't depend on a footer and should be readable incrementally (assuming you flush the file each time). This may even work over something like a domain socket depending on what you're trying to accomplish.
However, it sounds like you're interested in something more like a client-server setup? In that case, Arrow Flight may be more appropriate, especially once you have two different hosts. Flight is an RPC framework specialized for Arrow data transfer.

Related

searching in html/txt without loading it into program [duplicate]

I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.
If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
The above code is simple, and it generally works okay. But it has two problems:
It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?
I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.
So how can I speed this search up and also make it work for very large files?
Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?
Solution:
Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.
But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.
So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.
That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).
So I'm off and running. Thanks once again SO community.
The best approach here is probably to use memory mapped files.
First you need a file handle, use the CreateFile windows API function for that.
Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.
To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)
To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.
Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.
Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).
EDIT:
A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.
Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.
If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).
Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.
If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.
To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.
Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.
This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.
May I suggest a component ? If yes I would recommend ATStreamSearch.
It handles ANSI and UNICODE (and even EBCDIC and Korean and more).
Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.
It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.
The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.
This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.
If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.
If the files are to be searched multiple times, it could be a good idea to use a word index.
This is called "Full Text Search".
It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.
You have the exact parser you need in The Delphi Magazine Issue 78, February 2002:
"Algorithms Alfresco: Ask A Thousand Times
Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."
There are several FTS implementation for Delphi:
Rubicon
Mutis
ColiGet
Google is your friend..
I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such.
We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.

Why can't I read a JSON file with a different program while my Processing sketch is still open?

I'm writing data to a JSON file in Processing with the saveJSONObject command. I would like to access that JSON file with another program (MAX/MSP) while my sketch is still open. The problem is, MAX is unable to read from the file while my sketch is running. Only after I close the sketch is MAX able to import data from my file.
Is Processing keeping that file open somehow while the sketch is running? Is there any way I can get around this problem?
It might be easier to stream your data straight to MaxMSP using the OSC protocol. On the Processing side, have a look at the oscP5 library and on the Max side at the udpreceive object.
You could send your JSON object as a string and unpack that in Max (maybe using the JavaScript support already present in Max), but it might be simpler to mimic the structure of your JSON object as the arguments of the OSC message object which you simply umpack in Max directly.
Probably, because I/O is usually buffered (notably for performance reasons, ans also because the hardware is doing I/O by blocks).
Try to flush the output channel, perhaps using PrintWriter::flush or something similar.
Details are implementation specific (and might be operating system specific).

HTML5: accessing large structured local data

Summary:
Are there good HTML5/javascript options for selectively reading chunks of data (let's say to be eventually converted to JSON) from a large local file?
Problem I am trying to solve:
Some existing program locally and outputs a ton of data. I want to provide a browser-based interactive viewer that will allow folks to browse through these results. I have control over how the data is written out. I can write it all out in one big file, but since it's quite large, I can't just read the whole thing in memory. Hence, I am looking for some kind of indexed or db-like access to this from my webapp.
Thoughts on solutions:
1. Brute-force: HTML5 FileReader API has a nice slice() method for random access. So I could write out some kind of an index in the beginning of the file, use it to look up positions of other stored objects, and read them whenever they're needed. I figured I'd ask if there are already javascript libraries that do something like this (or better) before trying to implement this ugly thing.
2. HTML5 local database. Essentially, I am looking for an analog of HTML5 openDatabase() call that would open (a read-only) connection to a database based on a user-specified local file. From what I understand, there's no way to specify a file with a pre-loaded database. Furthermore, even if there was such a hack, it's not clear whether the local file format would be the same across browsers. I've seen the phonegap solution that populates the browser local database from SQL statements. I can do that too, but the data I am talking about is quite large (5-10GB): it will take a while to load, and such duplication seems rather pointless.
HTML5 does not sound like the appropriate answer for your needs. HTML5's focus is on the client side, and based on your description you're asking a lot out of the browsers, most likely more than they can handle.
I would instead recommend you look at a server-based solution to deliver the desired goal/results to the client view, something like Splunk would be a good product to consider.

best practices for writing to a file from multiple methods

I have a class that contains a bunch of methods for checking data I scrape every week (for things like well-formedness and other errors in gathering the data). Each of these methods performs a test, and then prints out a summary of the test.
I want to print out the output from these tests to a file, but I'm not sure what the best way to do it is. For example...
Should the class hold an instance variable to the file, and each method open/appends/closes the file? (A problem is that methods sometimes call other methods, so this seems kinda messy?)
Should each method get passed the file as a parameter? (Seems messy as well.)
Should each method return a string, and a"central" method that calls all the other tests outputs all these strings to a file?
I'm not really familiar with using logger libraries -- would that be a solution?
My particular context
I have a scraper that pulls data from various websites and stores them in a database. Websites change all the time, so I'm writing a "scrape checker" program that checks my scrapes for various things, like:
number of empty results
length of results
weird characters in results
and so on
So I have methods like:
check_num_empty_results
check_weird_characters
check_scrape (calls a bunch of other checks)
check_scrape_pair (sometimes I want to check pairs of scrapes together, e.g., to match results against each other, so this is different checking each one in isolation)
etc.
I want my "scrape checker" program to print out a file that summarizes all the checks.
Separation of concerns. Write code the focuses on the scraping activity and return the value(s) scraped. Then use aspect oriented programming for logging, which can simplify the problem greatly as the aspect holds the reference to the file or logging API.
Ultimately, it depends on what language you're using.
The first solution makes the most sense if your language permits it. For each instance of the logging class, have a field for the file object that you're reading from/writing to. This is basically equivalent to passing the file object as a parameter to every method.
That said, most mature languages have modules that will do a lot of this work for you; off the top of my sh/awk, Perl, and Python all come to mind as being suited to this task (though if you want to, you could use Java or something else).
Seems like a logging framework would be a perfect solution for this. If you are using Java or .NET, log4j and log4net are pretty much the de-facto standards for that.

Testing and mocking with Flex

I am developing a "dumb" front-end, it's an AIR application that interacts with a "smart" LiveCycle server. There are currently about 20 request & response pairs for the application. For many reasons (testing, developing outside the corporate network, etc), we have several XML files of fake data, and if a certain configuration flag is set, the files are loaded, a specific file is parsed and used to create a mock response. Each XML file is a set of responses for different situation, all internally consistent. We currently have about 10 XML files, each corresponding to different situation we can run into. This is probably going to grow to 30-50 XML files.
The current system was developed by me during one of those 90-hour-week release cycles, when we were under duress because LiveCycle was down again and we had a deadline to meet. Most of the minor crap has been cleaned up.
The fake data is in an object called FakeData, with properties like customerType1:XML, customerType2:XML, overdueCustomer1:XML, etc. Then in the FakeData constructor, all of the properties are set like this:
customerType1:XML = FileUtil.loadXML(File.applicationDirectory.resolvePath("fakeData/customerType1.xml");
And whenever you need some fake data (this happens in special FakeDelegates that extend the real LiveCycle Delegates), you get it from an instance of FakeData.
This is awful, for many reasons, but it works. One embarrassing part is that every time you create an instance of FakeData, it reloads all the XML files.
I'm trying to figure out if there's a design pattern that is not Singleton that can handle this more elegantly. The constraints are:
No global instances can be required (currently, all the code dealing with the fake data, including the fake delegates, is pulled out of production builds without any side-effects, and it needs to stay that way). This puts the Factory pattern out of the running.
It can handle multiple objects using the XML data without performance issues.
The XML files are read centrally so that the other code doesn't have to know where the XML files are, and so some preprocessing can be done (like creating a map of certain tag values and the associated XML file).
Design patterns, or other architecture suggestions, would be greatly appreciated.
Take a look at ASMock which was developed by a good friend of mine (and a member here Richard Szalay) and is based on .nets Rhino mocks. We've used it in several production environments now so i can vouch for it's stability.
should be able to get rid of any fake tests (more like integration tests) by using the mock object instead.
Wouldn't it make more sense to do traditional mocking with a mocking framework? Depending on your implementation, it might be possible to set up the Expects by reading the fake-data XML files.
Here is a Google Code project that offers mocking for ActionScript.