Trying to pull the Name and/or ID of the code below, but can only pull the Job-Base-Cost - html

Below is the code I have now. It pulls the Job-Base-Cost just fine, however I cannot get it to pull the ID and or Name of the item. Can you help?
Link to the sites XML pull.
=importxml("link","//job-base-cost")

This is a sample of one line of the OP's XML file
<job-base-cost id="24693" name="Abaddon Blueprint">109555912.69</job-base-cost>
The OP wants to use the IMPORTXML function to report the ID and Name as well as the Job Cost from the XML data. Presently, the OP's formula is:
=importxml("link","//job-base-cost")
There are two options:
1 - One long column
=importxml("link","//#id | //#name | //job-base-cost")
Note //#id and //#name in the xpath query: // indicate nodes in the document (at any level, not just the root level) and # indicate attributes. The pipe | operator indicates AND. So the plain english query is to display the id, name and job-base-cost.
2 - Three columns (table format)
={IMPORTXML("link","//#name"),IMPORTXML("link","//job-base-cost"),IMPORTXML("link","//#id")}
This creates a series that will display the fields in each of three columns.
Note: there is an arrayformula that uses a single importXML function described in How do I return multiple columns of data using ImportXML in Google Spreadsheets?. Readers may want to look at whether that option can be implemented.
My thanks to #Tanaike for his comment which spurred me to look at how xpath works.

Related

Generating Date Range in Google Sheets

I have a Google Sheet that contains extracted metadata from a large amount of files that I transferred from CDs to a server. I am currently working on creating a description for these materials to include in a finding aid. I found it easiest to work in Excel or Sheets because the PUI we use to output our finding aids utilizes a spreadsheet upload plugin.
I've been using pivot tables in Google Sheets to sort through all of the data with little issue except when I need to generate a date range for the files contained in one CD. Essentially, I'm creating a pivot table that contains rows for the URI, Filename (in this case I'm filtering for folder name only), and date_modified. The data looks something like this:
URI
FILENAME
DATE_MODIFIED
URI1
FOLDER1
2000-01-01
URI1
FOLDER2
2000-01-01
URI1
FOLDER3
2000-02-01
URI1
FOLDER4
1999-12-02
URI2
FOLDER1
2001-01-20
... and so on.
I'd like to generate a date range for each unique URI. I realize I could just go through each unique URI and manually extract that data but I have a LOT of these to go through so I don't think it is the most efficient use of my time. Especially, when you notice that the dates do not follow a chronological order. I'm thinking that pivot tables are not going to help me here so if anyone has other suggestions I'm happy to listen but brownie points if anyone has a solution that works in Sheets!
Try this on a new tab somewhere.
=QUERY(Sheet1!A:C,"select A,MIN(C),MAX(C) where A<>'' group by A")
change the range ref to suit.
Then in the next column over, depending on where you output the query,
=IF(A2="",,TEXT(B2,"yyyy-mm-dd")&"-"&TEXT(C2,"yyyy-mm-dd"))
drag down to the bottom.

Apache NiFi: Mapping an external file to get a new column

I was following the answer in this stack overflow question. But I couldn't get the output I am expecting to get. I have a sample csv, which looks like this:
id,name,city
12,Jimmy,Ontario
33,Kimmel,York
Every city has a unique code, which I have stored in another csv.This is how my csv to be used to map looks like. (I have separated the two values in the row using a tab in the real txt file)
California 5435
Ontario 2342
York 3456
The final output must be like the following:
id,name,city,code
12,Jimmy,Ontario,2342
33,Kimmel,York,3456
This csv has much more data therefore the replacement cannot be achieved by using ReplaceText Processor. So it can be done only by using the ReplaceTextwithMapping Processor.
I have followed the exact steps used as an answer in this question. But it seems like the Replace TextwithMapping is not working as expected. Since a new column is made successfully. But it's just contains the same content of the city column not the codes I want.
Its greatly appreciated if you could submit an answer that I can follow to finally succeed in getting the desired output using ReplaceTextwithMapping Processor
I could make it work kind of using LookupRecord. The overall flow looks like this:
GenerateFlowFile:
LookupRecord:
Configure CSVReader and CSVRecordSetWriter to treat the first line as header line. For the other properties leave the defaults. Configure CSVRecordLookupService:
The mapping file has following content:
city,code
Ontario,2342
California,5435
The LookupRecord processor will take the city value and lookup the proper record from the mapping file. It will extract the code and add the field to the current record. Result:
As you can see the code has been added to the CSV file. However it is enclosed in an object, although I have selected Insert Record Fields setting. I could not solve this problem. This could be another question to ask explicitly.

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

NiFi : Regular Expression in ExtractText gets CSV header instead of data

I'm working on a flow where I get CSV files. I want to put the records into different directories based on the first field in the CSV record.
For ex, the CSV file would look like this
country,firstname,lastname,ssn,mob_num
US,xxxx,xxxxx,xxxxx,xxxx
UK,xxxx,xxxxx,xxxxx,xxxx
US,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
I want to get the field value of the first field i.e, country. Put those records into a particular directory. US records goes to US directory, UK records goes to UK directory, and so on.
The flow that I have right now is:
GetFile ----> SplitText(line split count = 1 & header line count = 1) ----> ExtractText (line = (.+)) ----> PutFile(Directory = \tmp\data\${line:getDelimitedField(1)}). I need the header file to be replicated across all the split files for a different purpose. So I need them.
The thing is, the incoming CSV file gets split into multiple flow files with the header successfully. However, the regex that I have given in ExtractText processor evaluates it against the splitted flow files' CSV header instead of the record. So instead of getting US or UK in the "line" attribute, I always get "country". So all the files go to \tmp\data\country. Help me how to resolve this.
I believe getDelimitedField will only work off a singular line and is likely not moving past the newline in your split file.
I would advocate for a slightly different approach in which you could alter your ExtractText to find the country code through a regular expression and avoid the need to include the contents of the file as an attribute.
Using a regex of ^.*\n+(\w+) will capture the first line and the first set of word characters up to the comma and place them in the attribute name you specify in capture group 1. (e.g. country.1).
I have created a template that should get the value you are looking for available at https://github.com/apiri/nifi-review-collateral/blob/master/stackoverflow/42022249/Extract_Country_From_Splits.xml

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.