I have data as below:
BinaryName Licens_details
-------------------------------------------------------
usr/sbin/if_up BSD-2C:63, BSD-2C:1576, BSD-3C:4, BSD-3C:214,
BSD-3C:241, BSD-3C:248, BSD-3C:259, BSD-4C:62,
BSD-4C:137, BSD-4C:154, BSD-4C:164, BSD-V:3,
NOTE:4, NOTE:5, NOTE:15
-----------------------------------------------------------------------------
usr/sbin/random BSD-3C:4, BSD-3C:214, BSD-4C:61, BSD-4C:154,
NOTE:4, NOTE:5, NOTE:15, ZLIB:18, ZLIB:55
---------------------------------------------------------------------------
usr/bin/vcapture-test BSD-4C:154, NOTE:2, NOTE:4, NOTE:5, NOTE:15
-----------------------------------------------------------------------------------
License_details data is of pattern :
Each license_id represents full notice text in another table, copright_info id represented by integer is index into other big table with copyright info, author, acknowledgement details etc.
The question is, how do I design query to get both license id, and copyright id since they are very random, not serial.
One way is to manually see every id and design a query for every row of data.
But this is quite takes more effort as we more than 300 pages of data.
Another way is to have that data in excel, and parse by perl script.
The perl script becomes a table column. I mean every row of table should execute a perl script to get list of license ids, and copyright ids.
Any help will be greatly appreciated.
Related
I would like to structure my long format SPSS file so I can clean it and get a better overview. However, I run into some problems.
Patients appear several times in the database (Column patientID). How can I make a new variable that contains only 1 patient ID preferable on the line with baseline data/first moment that questionnaires are completed?
I have consulted with my colleagues, but without concrete solutions/answers
This can be done using the lag function - after sorting the file:
sort cases by PatientID_Pseudo OpenInvulMomenten.
if $casenum=1 or ($casenum>1 and PatientID_Pseudo<>lag(PatientID_Pseudo)) newvar=PatientID_Pseudo.
exe.
Below is the code I have now. It pulls the Job-Base-Cost just fine, however I cannot get it to pull the ID and or Name of the item. Can you help?
Link to the sites XML pull.
=importxml("link","//job-base-cost")
This is a sample of one line of the OP's XML file
<job-base-cost id="24693" name="Abaddon Blueprint">109555912.69</job-base-cost>
The OP wants to use the IMPORTXML function to report the ID and Name as well as the Job Cost from the XML data. Presently, the OP's formula is:
=importxml("link","//job-base-cost")
There are two options:
1 - One long column
=importxml("link","//#id | //#name | //job-base-cost")
Note //#id and //#name in the xpath query: // indicate nodes in the document (at any level, not just the root level) and # indicate attributes. The pipe | operator indicates AND. So the plain english query is to display the id, name and job-base-cost.
2 - Three columns (table format)
={IMPORTXML("link","//#name"),IMPORTXML("link","//job-base-cost"),IMPORTXML("link","//#id")}
This creates a series that will display the fields in each of three columns.
Note: there is an arrayformula that uses a single importXML function described in How do I return multiple columns of data using ImportXML in Google Spreadsheets?. Readers may want to look at whether that option can be implemented.
My thanks to #Tanaike for his comment which spurred me to look at how xpath works.
I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.
I have a MySQL (5.6) database on my local workstation into which I routinely pull large datasets to perform analysis on. I have a separate SQL script for each dataset that imports the data and reformats it when needed (notably to convert date formats). In addition, I have other scripts that perform detailed analysis on the data.
For quality assurance, I would like to have a table named ImportLog that stores a record to capture the result of each import that is run. This table would look like the following:
ImportName DateRun RowsImported
---------- ------- ------------
ImportASR 2015-08-29 12902
ImportEAD 2015-08-30 18023
ImportHRData 2015-08-30 122376
The column definitions for ImportLog are as follows:
ImportName // the name of the script that is run
DateRun // the date that the script is run
RowsImported // the count of records imported in the run.
At the very end of each script would be the code to write one line to this table with the relevant data. For example, let's say that I ran the script named ImportASR on 8/29/2015 and it imported 12,902 records. At the end of the script, I want to append one record to ImportLog (like the first record in the table above) using something like this:
INSERT INTO ImportLog
VALUES("ImportASR",$DateRun,$RowCunt);
Every time I run one of the import scripts, it would add a row to the ImportLog table with the appropriate data.
My question is: How do I populate the $DateRun variable with the current date and the $RowCount variable with the row count of the newly imported ASR dataset? Or am I trying to approach this from the wrong angle?
First thing this morning I stumbled upon the answer to my problem; it was amazingly simple, and to my surprise it didn't need the use of any variables. The code to put at the end of each import script is something like:
INSERT INTO ImportLog
"Script: ImportASR",
SELECT NOW(),
(SELECT COUNT(*) FROM ASR_Full);
The InportLog table is initially defined like so:
CREATE TABLE LPIS_SearchMatchLog (
ImportName VARCHAR(25),
DateRun DATETIME,
RowCount INT
);
Hope this helps someone else!
I frequently need to pull some CSV reports and analyze them using powerpivot. The "issue" is that the tool spits out the report like this:
Report Name Keywords (Group contains 778600, Campaign contains us-en)
Client XYZ
Scope Entire Account
Date Range 3/12/2015
Filters Campaign contains us-en; Group contains 778600; Clicks > 0; Reduced Dimension
Keyword Account Publisher Campaign Group Search Bid $ Status Destination URL
Total for all 2 keywords
Keyword Account Publisher Campaign Group Search Bid $ Status Destination URL
bla bla bla Account Name Publisher Name Campaign Name Group Name 1 Active URL
So what i always need to do is to remove the first 9 rows of the CSV prior to importing. Usually i can do this on Notepad++, but sometimes the CSV is so large that i actually can't really open it to edit. So far i'm using a program called 010 Editor, but i have only some days left of it.
Is there an easy way to skip those rows when importing?
Thanks a lot
You can use Power Query (free to download) to load data to Power Pivot. It allows you to skip the first x rows and filter out rows with blank/null values. Once you are able to get this to work once, you can copy the M code to use it on other CSVs. Or you can automate it as a function and just feed it file locations.