Best way to consolidate fragments within a pyarrow dataset? - pyarrow

Is there a better way to accomplish my goals with this code below:
I want to read data based on filters from Dataset A which contains a lot of small fragments (there are many files in this dataset because I download data frequently)
I want to consolidate the fragments based on the partition in a loop (I am using this when I cannot fit all of the filters into memory, so I process them one by one)
I want to write this data to a new dataset (Dataset B) into a consolidated file which gets read by our BI tool - unfortunately there is no partition_filename_cb function so I need to use the legacy write_to_dataset for this - this file is generally the name of the partition
I would really like to clean up Dataset A. Overtime, more and more files get added to the partitions since I am downloading data frequently and rows can be updated (some of these fragment files only have 1 or two records)
Below is my current process.
I use a ds.Scanner to apply my filters and select my columns from an original dataset
def retrieve_fragments(dataset, filter_expression, columns):
"""Creates a dictionary of file fragments and filters from a pyarrow dataset"""
fragment_partitions = {}
scanner = ds.Scanner.from_dataset(dataset, columns=columns, filter=filter_expression)
fragments = scanner.get_fragments()
for frag in fragments:
keys = ds._get_partition_keys(frag.partition_expression)
fragment_partitions[frag] = keys
return fragment_partitions
Below I create small lists of all of the fragments that have the same filter expression. I can then write these to a new dataset into a consolidated file, and I assume that I can also delete the individual fragment files and write a new consolidated version as well?
fragments = retrieve_fragments(
dataset=dataset, filter_expression=filter_expression, columns=read_columns
)
unique_filters = []
dfs = []
for fragment, filter_value in fragments.items():
if filter_value not in unique_filters:
unique_filters.append(filter_value)
#each chunk is a list of all of the fragments with the same partition_expression / filter which we turn into a new dataset that we can then process or resave into a consolidated file
for unique_filter in unique_filters:
chunks = []
for frag, filter_value in fragments.items():
if filter_value == unique_filter:
chunks.append(frag.path)
logging.info(
f"Combining {len(chunks)} fragments with filter {unique_filter} into a single table"
)
table = ds.dataset(chunks, partitioning=partitioning, filesystem=filesystem).to_table(columns=read_columns)
#ignoring metadata due to some issues with columns having a boolean type even though they were never boolean
df = table.to_pandas(ignore_metadata=True)
#this function would just sort and drop duplicates on a unique constraint key
df = prepare_dataframe(df)
table = pa.Table.from_pandas(df=df, schema=dataset.schema, preserve_index=False)
#write dataset to Dataset B (using partition_filename_cb)
#I believe I could now also write the dataset back to Dataset A in a consolidated parquet file and then delete all of the fragment.paths. This would leave me with only a single file in the partition "folder"
The output of this would save a single file per partition into a new dataset (/dev/interactions-final/created_date=2019-11-13/2019-11-13.parquet)
INFO - Combining 78 fragments with filter {'created_date': datetime.date(2019, 11, 13)} into a single table
INFO - Saving 172657 rows and 36 columns (70.36 MB to dev/interactions-final)
INFO - Combining 57 fragments with filter {'created_date': datetime.date(2019, 11, 18)} into a single table
INFO - Saving 67036 rows and 36 columns (29.63 MB to dev/interactions-final)
INFO - Combining 55 fragments with filter {'created_date': datetime.date(2019, 11, 19)} into a single table
INFO - Saving 65035 rows and 36 columns (29.62 MB to dev/interactions-final)
INFO - Combining 63 fragments with filter {'created_date': datetime.date(2019, 11, 20)} into a single table
INFO - Saving 63613 rows and 36 columns (30.76 MB to dev/interactions-final)

Have you tried write_dataset (code here)? It will repartition and I think it collects small fragments in the process.

Related

Importing multiple 1D JSON arrays in Excel

I'm trying to import a JSON file containing multiple unrelated 1D arrays with variable amount of elements into Excel.
The JSON I wrote is :
{
"table":[1,2,3],
"table2":["A","B","C"],
"table3":["a","b","c"]
}
When I import the file using Power Query and expand the columns, it multiplies the previous entries each time I expand a new column.
enter image description here
I there a way to solve this, shows the elements of each array below each other and each array as a new column?
One method would be to transform each Record into a List and then create a table using Table.FromColumns method.
This needs to be done from the Advanced Editor:
Read the code comments and explore the Applied Steps to better understand.
Also HELP topics for the various functions will be useful
let
//Change following line to reflect your actual data source
Source = Json.Document(File.Contents("C:\Users\ron\Desktop\New Text Document.txt")),
//Get Field Names (= table names)
fieldNames = Record.FieldNames(Source),
//Create a list of lists whereby each sublist is derived from the original record
jsonLists = List.Accumulate(fieldNames,{}, (state, current)=> state & {Record.Field(Source,current)}),
//Convert the lists into columns of a new table
myTable = Table.FromColumns(
jsonLists,
fieldNames
)
in
myTable
Results

Dataframe is of type 'nonetype'. How should I alter this to allow merge function to operate?

I have pulled in data from a number of csv files, as well as a database. I wish to use a merge function to make a dataframe isolating the phone numbers that are contained in both dataframes(one originating from csv, the other originating from the database). However, the dataframe from the database displays as type 'nonetype.' This disallows any operation such as merge. How can i change this to allow the operation?
The data comes in from the database as a list of tuples. I then convert this to a dataframe. However, as stated above, it displays as 'nonetype.' I'm assuming at the moment I am confused about about how dataframes handle data types.
#Grab Data
mycursor = mydb.cursor()
mycursor.execute("SELECT DISTINCT(Cell) FROM crm_data.ap_clients Order By Cell asc;")
apclients = mycursor.fetchall()
#Clean Phone Number Data
for index, row in data.iterrows():
data['phone_number'][index] = data['phone_number'][index][-10:]
for index, row in data2.iterrows():
data2['phone_number'][index] = data2['phone_number'][index][-10:]
for index, row in data3.iterrows():
data3['phone_number'][index] = data3['phone_number'][index][-10:]
#make data frame from csv files
fbl = pd.concat([data,data2,data3], axis=0, sort=False)
#make data frame from apclients(database extraction)
apc = pd.DataFrame(apclients)
#perfrom merge finding all records in both frames
successfulleads= pd.merge(fbl, apc, left_on ='phone_number', right_on='0')
#type(apc) returns NoneType
The expected results are to find all records in both dataframes, along with a count so that I may compare the two sets. Any help is greatly appreciated from this great community :)
So it looks like I had a function to rename the column of the dataframe as shown below:
apc = apc.rename(columns={'0': 'phone_number'}, inplace=True)
for col in apc.columns:
print(col)
the code snippet out of the above responsible:
inplace=True
This snippet dictates whether or not the object is modified in the dataframe, or whether a copy is made. The return type on said object is of nonetype.
Hope this helps whoever ends up in my position. A great thanks again to the community. :)

Dynamically merge two CSV files using Dataweave in Mule

I get CSV files of different length from different sources. The columns within the CSV are different with the only exception is each CSV file will always have an Id column which can be used to tie the records within different CSV files. At a time, two such CSV files needs to be processed. The process is to take the Id column from the first file and match the rows within the second CSV file and create a third file which contains contents from the first and second file. The id column can be repeated in the first file. Eg is given below. please note that the first file I might have 18 to 19 combination of different data columns so, I cannot hardcode the transformation within dataweave and there is a chance that a new file will be added every time as well. A dynamic approach is what I wanted to accomplish. So once written, the logic should work even if a new file is added. These files get pretty big as well.
The sample files are given below.
CSV1.csv
--------
id,col1,col2,col3,col4
1,dat1,data2,data3,data4
2,data5,data6,data6,data6
2,data9,data10,data11,data12
2,data13,data14,data15,data16
3,data17,data18,data19,data20
3,data21,data22,data23,data24
CSV2.csv
--------
id,obectId,resid,remarks
1,obj1,res1,rem1
2,obj2,res2,rem2
3,obj3,res3,rem3
Expected file output -CSV3.csv
---------------------
id,col1,col2,col3,col4,objectid,resid,remarks
1,dat1,data2,data3,data4,obj1,res1,rem1
2,data5,data6,data6,data6,obj2,res2,rem2
2,data9,data10,data11,data12,obj2,res2,rem2
2,data13,data14,data15,data16,obj2,res2,rem2
3,data17,data18,data19,data20,obj3,res3,rem3
3,data21,data22,data23,data24,obj3,res3,rem3
I was thinking to use pluck to get the column values for the first file. I idea was to get the columns in the transformation without hardcoding it. But I am getting some errors. After this I have the task of searching for the id and getting the value from the second file
{(
using(keys = payload pluck $$)
(
payload map
( (value, index) ->
{
(keys[index]) : value
}
)
)
)}
I am getting the following error when using pluck
Type mismatch for 'pluck' operator
found :array, :function
required :object, :function
I am thinking of using groupBy on id on the second file to facilitate better searching. But need suggestions on how to append the contents in one transformation to form the 3rd file.
Since you want to combine both CSVs without renaming the column names, you can try something like below
var file2Grouped=file2 groupBy ((item) -> item.id)
---
file1 map ((item) -> item ++ ((file2Grouped[item.id])[0] default {}) - 'id')
output
id,col1,col2,col3,col4,obectId,resid,remarks
1,dat1,data2,data3,data4,obj1,res1,rem1
2,data5,data6,data6,data6,obj2,res2,rem2
2,data9,data10,data11,data12,obj2,res2,rem2
2,data13,data14,data15,data16,obj2,res2,rem2
3,data17,data18,data19,data20,obj3,res3,rem3
3,data21,data22,data23,data24,obj3,res3,rem3
Working expression is as given below. The removing the id should happen before the default
var file2Grouped=file2 groupBy ((item) -> item.id)
---
file1 map ((item) -> item ++ ((file2Grouped[item.id])[0] - 'id' default {}))

Create a node for each column only once while importing csv into Neo4j

I have a csv file that looks the following way:
I want to create a database from it in Neo4j. Rows are nodes with labels gene, columns are also nodes with labels cell. I need to write a CREATE query that would create all my gene and cell - nodes and a relationship one for each combination of gene and cell. Currently I am stuck with the following code:
LOAD CSV WITH HEADERS FROM 'file:///merged_full.csv' AS line
CREATE (:Gene {id: line.gene_ids, name: line.wikigene_name})
I need to somehow iterate over all columns - starting from index 3 - after creating gene nodes, but I do not know how to do that.
Here are 3 queries that, performed in order, should do what you want.
This query creates a temporary Headers node with a names property that contains the collection of headers from the CSV file. It uses LIMIT 1 to only process the first row of the file. It also creates all the Cell nodes, each with it own name property.
LOAD CSV FROM 'file:///merged_full.csv' AS line
MERGE (h:Headers)
SET h.names = line
WITH line
LIMIT 1
UNWIND line[3..] AS name
MERGE (c:Cell {name: name})
This query uses the APOC function apoc.map.fromNodes to generate a map named cells, which maps each cell name to its cell node. It also gets the Headers node. It then loads the non-header data from the CSV file (using SKIP 1 to skip over the header row), and processes each row as follows. It uses MERGE to get/create a Gene node, g, with the desired id and name. It uses the REDUCE function to generate a collection of the Cell nodes that have a "1" column value in the current row, and the FOREACH clause then creates a (g)-[:HAS]->(x) relationship (if necessary) for every cell, x, in that collection.
WITH apoc.map.fromNodes('Cell', 'name') AS cells
MATCH (h:Headers)
LOAD CSV FROM 'file:///merged_full.csv' AS line
WITH h, cells, line
SKIP 1
MERGE (g:Gene {id: line[1], name: line[2]})
FOREACH(
x IN REDUCE(s = [], i IN RANGE(3, SIZE(line)-1) |
CASE line[i] WHEN "1" THEN s + cells[h.names[i]] ELSE s END) |
MERGE (g)-[:HAS]->(x))
This query just deletes the temporary Headers node (if you wish):
MATCH (h:Headers)
DELETE h;
If the columns correspond with cell nodes, then you should know all the cell nodes you need just be looking at the CSV header.
I'd recommend writing a small query just to create each of the cell nodes you need, then create an index or unique constraint on :Cell(id) (or name, or whatever the property is that is meant to identify a :Cell).
At that point the problem becomes getting and processing each relevant column (I assume only the ones with 1 as the value). APOC Procedures may help here.
apoc.map.sortedProperties() can be used to take your line map and give you a list of key/value list pairs, which you can filter down to those where the key begins with 'V', and where the value is 1, then use what's remaining to match on the relevant :Cell node and create the relationship.

SSIS - Process a flat file with varying data

I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.
The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.