How to append a dataframe to an existing Apache Arrow file on disk - pyarrow

Scenario: In my daily ETL process, I'm considering storing my data additionally as Apache Arrow files for the benefit of zero-copy serialization.
If I have an existing Apache Arrow file on disk with the previous data, how can I use pyarrow to append my processed data for the current day(as a dataframe) to the existing arrow file on disk?
I tried using the "a" mode but that didn't work
with pa.OSFile(output, "a") as sink:
with pa.RecordBatchFileWriter(sink, table_df.schema) as writer:
writer.write_table(table_df)
Is this not recommended? Am I violating the design intent of Apache Arrow?
My other approach would be to read the arrow file as a dataframe, append to it with my current data and then write it back again. But I'm wondering if there is a better approach?

Arrow files are immutable once written, you can't append data through existing APIs. Reading and then writing the data back in would be your only choice if all elements need to be in the same file.
The Arrow Stream format could technically support appending data, but you would lose random access to RecordBatch's

Related

How do we name the files that are streamed via firehose?

I'm building an architecture using boto3, and I hope to dump the data in JSON format from API to S3. What blocks in my way right now is first, firehose does NOT support JSON; my workaround right now is not compressing them but it's still different from a JSON file. But I still want to see a better choice to make the files more compatible.
And second, the file names can't be customized. All the data I collected will be eventually converted onto Athena for the query, so can boto3 do the naming?
Answering a couple of the questions you have. Firstly if you stream JSON into Firehose it will write JSON to S3. JSON is the file data structure and compression is the file type. Compressing JSON doesn't make it something else. You'll just need to decompress it before consuming it.
RE: file naming, you shouldn't care about that. Let the system name it whatever. If you define the Athena table with the location, you'll be able to query it. When new files are added, you'll be able to query them immediately.
Here is an AWS tutorial that walks you through this process. JSON stream to S3 with Athena query.

Best data processing software to parse CSV file and make API call per row

I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

How to create BigQuery Table from JSONs in Google Cloud Storage if some fields have forbidden characters?

I'm trying to move a bunch of data that I have in a bucket (newline delimited json files) into BigQuery. BigQuery forbids certain characters in their field names, such as dashes - or slashes. Our data unfortunately has dashes in many of the field names, i.e.
jsonPayload.request.x-search
I tried renaming the field in the BigQuery schema to
jsonPayload.request.x_search hoping that the loader would do some magic, but nope.
Aside from running a job to rename the fields in storage (really undesirable, especially because new files come in hourly), is there a way to map fields in the JSON files to fields in the BQ schema?
I've been using the console UI but it makes no difference to me what interface to use with BQ.
I see a few option to work around this:
Create a Cloud Function to trigger when your new files arrive. Inside that function, read the contents of the file, and transform it. Write the results back to a new file and load it into BigQuery. I'm not sure how scalable this is in your situation. If your files are quite big, then this might not work.
Create a Cloud Function to trigger when your new files arrive, and then invoke a Dataflow templated pipeline to ingest, transform, and write the data to BigQuery. This is scalable, but comes with extra costs (Dataflow). However, it's a nice pattern for loading data from GCS into BigQuery.
Lazily, within BigQuery:
Import as CSV
One column per row, pick a delimiter that doesn't occur inside the files
Parse within BigQuery
Either with the BQ JSON functions
Or with javascript UDFs for maximum flexibility
At least this is what I usually do.

AWS Glue Crawler Classifies json file as UNKNOWN

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.
I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.
Has anyone else run into this issue? Is there a better way to do this?
I have two json files which are 42mb and 16mb, partitioned on S3 as path:
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json
I had the same problem as you, crawler classification as UNKNOWN.
I were able to solved it:
You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work
As mentioned in
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.
That is something which Dung also pointed out in his answer.
Please also note that file encoding can lead to JSON being classified as UNKNOWN. Please try and re-encode the file as UTF-8.