Separate nested dict inside a series column of a dataframe

Separate nested dict inside a series column of a dataframe - json

Trying to extract the username from the author column of the dataframe for each row, the author column is series & the individual values inside author is a dictionary.
Converting the author column directly to df & & changing the type of author column not helping to reach the goal.
I'm only able to reference the username via
df_item['author'][0]['username']
Trying to get a separate username column
id type content channel_id username
1047404831062638613 0 It’s really.. 1047331843898359849 lips
1047333443165499432 0 okay,thankyou 1047331843898359849 Mj

Did you try this:
df['author_username'] = df['author'].apply(lambda x: x['username'])

First convert the series to list then pass it to pandas dataframe and it takes care of the rest.
df_new = DataFrame(list(df2["author"]))

Related

Importing multiple 1D JSON arrays in Excel

I'm trying to import a JSON file containing multiple unrelated 1D arrays with variable amount of elements into Excel.
The JSON I wrote is :
{
"table":[1,2,3],
"table2":["A","B","C"],
"table3":["a","b","c"]
}
When I import the file using Power Query and expand the columns, it multiplies the previous entries each time I expand a new column.
enter image description here
I there a way to solve this, shows the elements of each array below each other and each array as a new column?

One method would be to transform each Record into a List and then create a table using Table.FromColumns method.
This needs to be done from the Advanced Editor:
Read the code comments and explore the Applied Steps to better understand.
Also HELP topics for the various functions will be useful
let
//Change following line to reflect your actual data source
Source = Json.Document(File.Contents("C:\Users\ron\Desktop\New Text Document.txt")),
//Get Field Names (= table names)
fieldNames = Record.FieldNames(Source),
//Create a list of lists whereby each sublist is derived from the original record
jsonLists = List.Accumulate(fieldNames,{}, (state, current)=> state & {Record.Field(Source,current)}),
//Convert the lists into columns of a new table
myTable = Table.FromColumns(
jsonLists,
fieldNames
)
in
myTable
Results

Apache NiFi: Creating new column using a condition

I have asked a similar question. Yet I wasn't able to find a solution for my problem through that approach. I have a csv which looks like this:
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
I want to add a new column and add values according to the data in the number field.This is the logic.
If the first 4 digits of data in the number column is equal to "0763" then the new column named (status) must be set as INSIDE or if it is any other value its OUTSIDE
As mentioned in the logic the output must look like this:
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE
My Approach
I tried to achieve this by first duplicating the number column to the status column. And then trying to take the first 4 digits and dealing with it.
Hope you would be able to suggest a way to Nifi Workflow to make this possible.

I used the UpdateRecord processor twice and got the results that you want.
Input
I started with your input data.
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
Process
First, set the UpdateRecord processor as follows:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Record Path Value
/status /number
it will create the new column status with the value of number column.
Second, the first output should go to another UpdateRecord processor with the options
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Literal Value
/status ${field.value:substring(0,4):equals('0763'):ifElse(${field.value:replace(${field.value},'INSIDE')},${field.value:replace(${field.value},'OUTSIDE')})}
and this will give you the final results.
Be aware that the number column is not an integer column, so you have to set the record reader CSVReader with the option Schema Access Strategy to the Use String Fields From Header.
Output
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE

You can try below logic :-
SplitText ->
ExtractText Processor ->
RouteOnAttribute(Add condition if first four number is 0763)
-----Match Relation--> ReplaceText(Extracted Attribute from file + "INSIDE") -> PutFile
-----Unmatch Relation--> ReplaceText(Extracted Attribute from file + "OUTSIDE") -> PutFile
Hope this will help you.

Dataframe is of type 'nonetype'. How should I alter this to allow merge function to operate?

I have pulled in data from a number of csv files, as well as a database. I wish to use a merge function to make a dataframe isolating the phone numbers that are contained in both dataframes(one originating from csv, the other originating from the database). However, the dataframe from the database displays as type 'nonetype.' This disallows any operation such as merge. How can i change this to allow the operation?
The data comes in from the database as a list of tuples. I then convert this to a dataframe. However, as stated above, it displays as 'nonetype.' I'm assuming at the moment I am confused about about how dataframes handle data types.
#Grab Data
mycursor = mydb.cursor()
mycursor.execute("SELECT DISTINCT(Cell) FROM crm_data.ap_clients Order By Cell asc;")
apclients = mycursor.fetchall()
#Clean Phone Number Data
for index, row in data.iterrows():
data['phone_number'][index] = data['phone_number'][index][-10:]
for index, row in data2.iterrows():
data2['phone_number'][index] = data2['phone_number'][index][-10:]
for index, row in data3.iterrows():
data3['phone_number'][index] = data3['phone_number'][index][-10:]
#make data frame from csv files
fbl = pd.concat([data,data2,data3], axis=0, sort=False)
#make data frame from apclients(database extraction)
apc = pd.DataFrame(apclients)
#perfrom merge finding all records in both frames
successfulleads= pd.merge(fbl, apc, left_on ='phone_number', right_on='0')
#type(apc) returns NoneType
The expected results are to find all records in both dataframes, along with a count so that I may compare the two sets. Any help is greatly appreciated from this great community :)

So it looks like I had a function to rename the column of the dataframe as shown below:
apc = apc.rename(columns={'0': 'phone_number'}, inplace=True)
for col in apc.columns:
print(col)
the code snippet out of the above responsible:
inplace=True
This snippet dictates whether or not the object is modified in the dataframe, or whether a copy is made. The return type on said object is of nonetype.
Hope this helps whoever ends up in my position. A great thanks again to the community. :)

Extract comma-separated values from JSON Records within a List with PowerQuery

As part of a tool I am creating for my team I am connecting to an internal web service via PowerQuery.
The web service returns nested JSON, and I have trouble parsing the JSON data to the format I am looking for. Specifically, I have a problem with extracting the content of records in a column to a comma separated list.
The data
As you can see, the data contains details related to a specific "race" (race_id). What I want to focus on is the information in the driver_codes which is a List of Records. The amount of records varies from 0 to 4 and each record is structured as id: 50000 (50000 could be any 5 digit number). So it could be:
id: 10000
id: 20000
id: 30000
As requested, an example snippet of the raw JSON:
<race>
<race_id>ABC123445</race_id>
<begin_time>2018-03-23T00:00:00Z</begin_time>
<vehicle_id>gokart_11</vehicle_id>
<driver_code>
<id>90200</id>
</driver_code>
<driver_code>
<id>90500</id>
</driver_code>
</race>
I want it to be structured as:
10000,20000,30000
The problem
When I choose "Extract values" on the column with the list, then I get the following message:
Expression.Error: We cannot convert a value of type Record to type
Text.
If I instead choose "Expand to new rows", then duplicate rows are created for each unique driver code. I now have several rows per unique race_id, but what I wanted was one row per unique race_id and a concatenated list of driver codes.
What I have tried
I have tried grouping the data by the race_id, but the operations allowed when grouping data do not include concatenating rows.
I have also tried unpivoting the column, but that leaves me with the same problem: I still get multiple rows.
I have googled (and Stack Overflowed) this issue extensively without luck. It might be that I am using the wrong keywords, however, so I apologize if a duplicate exists.
UPDATE: What I have tried based on the answers so far
I tried Alexis Olson's excellent and very detailed method, but I end up with the following error:
Expression.Error: We cannot convert the value "id" to type Number. Details:
Value=id
Type=Type
The error comes from using either of these lines of M code (one with a List.Transform and one without):
= Table.Group(#"Renamed Columns", {"race_id", "begin_time", "vehicle_id"},
{{"DriverCodes", each Text.Combine([driver_code][id], ","), type text}})
= Table.Group(#"Renamed Columns", {"race_id", "begin_time", "vehicle_id"},
{{"DriverCodes", each Text.Combine(List.Transform([driver_code][id], each Number.ToText(_)), ","), type text}})
NB: if I do not write [driver_code][id] but only [id] then I get another error saying that column [id] does not exist.

Here's the JSON equivalent to the XML example you gave:
{"race": {
"race_id": "ABC123445",
"begin_time": "2018-03-23T00:00:00Z",
"vehicle_id": "gokart_11",
"driver_code": [
{ "id": "90200" },
{ "id": "90500" }
]}}
If you load this into the query editor, convert it to a table, and expand out the Value record, you'll have a table that looks like this:
At this point, choose Expand to New Rows, and then expand the id column so that your table looks like this:
At this point, you can apply the trick #mccard suggested. Group by the first columns and aggregate over the last using, say, max.
This last step produces M code like this:
= Table.Group(#"Expanded driver_code1",
{"Name", "race_id", "begin_time", "vehicle_id"},
{{"id", each List.Max([id]), type text}})
Instead of this, you want to replace List.Max with Text.Combine as follows:
= Table.Group(#"Changed Type",
{"Name", "race_id", "begin_time", "vehicle_id"},
{{"id", each Text.Combine([id], ","), type text}})
Note that if your id column is not in the text format, then this will throw an error. To fix this, insert a step before you group rows using Transform Tab > Data Type: Text to convert the type. Another options is to use List.Transform inside your Text.Combine like this:
Text.Combine(List.Transform([id], each Number.ToText(_)), ",")
Either way, you should end up with this:

An approach would be to use the Advanced Editor and change the operation done when grouping the data directly there in the code.
First, create the grouping using one of the operations available in the menu. For instance, create a column"Sum" using the Sum operation. It will give an error, but we should get the starting code to work on.
Then, open the Advanced Editor and find the code corresponding to the operation. It should be something like:
{{"Sum", each List.Sum([driver_codes]), type text}}
Change it to:
{{"driver_codes", each Text.Combine([driver_codes], ","), type text}}

reading csv columns dynamically in Pentaho Kettle

If I have a Table Input step with a query such as Select * from myTable
and it goes to a User Defined Java Class step, the following code allows me to grab the column names dynamically from the table.
RowMetaInterface rowMetaInterface = getInputRowMeta();
List myList = rowMetaInterface.getValueMetaList();
String colName;
for(int i=0;i<myList.size();i++){
colName = ((ValueMetaInterface)myList.get(i)).getName();
}
However, this code doesn't work if the first step is a CSV input step. I have a variable for the CSV filename, so I can't do a 'Get Fields' to pull the columns. Is there a way I can read the csv column names dynamically?

Not a solution, but some interesting hints:
http://diethardsteiner.github.io/pdi/2015/10/31/Transformation-Executor-Record-Groups.html

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Separate nested dict inside a series column of a dataframe - json

Did you try this: df['author_username'] = df['author'].apply(lambda x: x['username'])

First convert the series to list then pass it to pandas dataframe and it takes care of the rest. df_new = DataFrame(list(df2["author"]))

Related

Importing multiple 1D JSON arrays in Excel

Apache NiFi: Creating new column using a condition

Dataframe is of type 'nonetype'. How should I alter this to allow merge function to operate?

Extract comma-separated values from JSON Records within a List with PowerQuery

reading csv columns dynamically in Pentaho Kettle

Categories

Resources