I am new to prolog and am considering using it for a small data analysis application. Here is what I am seeking to accomplish:
I have a CSV file with some data of the following from:
a,b,c
d,e,f
g,h,i
...
The data is purely numerical and I need to do the following: 1st, I need to group rows according to the following scheme:
So what's going on above?
I start at the 1st row, which has value 'a' in column one. Then, I keep going down the rows until I hit a row whose value in column one differs from 'a' by a certain amount, 'z'. The process is then repeated, and many "groups" are formed after the process is complete.
For each of these groups, I want to find the mean of columns two and three (as an example, for the 1st group in the picture above, the mean of column two would be: (b+e+h)/3).
I am pretty sure this can be done in prolog. However, I have 50,000+ rows of data and since prolog is declarative, I am not sure how efficient prolog would be at accomplishing the above task?
Is it feasible to work out a prolog program to accomplish the above task, so that efficiency of the program is not significantly lower than a procedural analog?
this snippet could be a starting point for your task
:- [library(dcg/basics)].
rownum(Z, AveList) :- phrase_from_file(row_scan(Z, [], [], AveList), 'numbers.txt').
row_scan(Z, Group, AveSoFar, AveList) -->
number(A),",",number(B),",",number(C),"\n",
{ row_match(Z, A,B,C, Group,AveSoFar, Group1,AveUpdated) },
row_scan(Z, Group1, AveUpdated, AveList).
row_scan(_Z, _Group, AveList, AveList) --> "\n";[].
% row_match(Z, A,B,C, Group,Ave, Group1,Ave1)
row_match(_, A,B,C, [],Ave, [(A,B,C)],Ave).
row_match(Z, A,B,C, [H|T],Ave, Group1,Ave1) :-
H = (F,_,_),
( A - F =:= Z
-> aggregate_all(agg(count,sum(C2),sum(C3)),
member((_,C2,C3), [(A,B,C), H|T]), agg(Count,T2,T3)),
A2 is T2/Count, A3 is T3/Count,
Group1 = [], Ave1 = [(A2,A3)|Ave]
; Group1 = [H,(A,B,C)|T], Ave1 = Ave
).
with this input
1,2,3
4,5,6
7,8,9
10,2,3
40,5,6
70,8,9
16,0,0
yields
?- rownum(6,L).
L = [ (3.75, 4.5), (5, 6)]
Related
Good evening,
I am trying to read a csv file in Prolog containing all the countries in the world. Executing this code:
read_KB(R) :- csv_read_file("countries.csv",R).
I get a list of Terms of this type:
R = [row('Afghanistan;'), row('Albania;'), row('Algeria;'), row('Andorra;'), row('Angola;'), row('Antigua and Barbuda;'), row('Argentina;'), row('Armenia;'), row(...)|...].
I would like to extract only the names of each country in form of a String and put all of them into a list of Strings.
I tried this way with only the first row executing this:
read_KB(L) :- csv_read_file("/Users/dylan/Desktop/country.csv",R),
give(R,L).
give([X|T],X).
I obtain only a Term of type row('Afghanistan;')
You can use maplist/3:
read_KB(Names) :-
csv_read_file('countries.csv', Rows, [separator(0';)]),
maplist([row(Name,_), Name] >> true, Rows, Names).
The answer given by #slago can be simplified, using arg/3 instead of a lambda expression, making it slightly more efficient:
read_KB(Names) :-
csv_read_file('countries.csv', Rows, [separator(0';)]),
maplist(arg(1), Rows, Names).
I am trying to create a relationship between two different graphs, using information in a CSV file. I built the query the way I did because the size of each graph, one being 500k+ and the other 1.5m+.
This is the query I have:
LOAD CSV WITH HEADERS FROM "file:///customers_table.csv" AS row WITH row
MATCH (m:Main) WITH m
MATCH (c:Customers) USING INDEX c:Customers(customer)
WHERE m.ASIN = row.asin AND c.customer = row.customer
CREATE (c)-[:RATED]->(m)
This is the error I receive:
Variable `row` not defined (line 4, column 16 (offset: 164))
"WHERE m.ASIN = row.asin AND c.customer = row.customer"
^
An example of the Main table is:
{
"ASIN": "0827229534",
"totalreviews": "2",
"categories": "2",
"title": "Patterns of Preaching: A Sermon Sampler",
"avgrating": "5",
"group": "Book"
}
And an example of a customer is:
{
"customer": "A2FMUVHRO76A32"
}
And inside the customers table csv, I have:
Customer, ASIN, rating
A2FMUVHRO76A32, 0827229534, 5
I can't seem to figure out why it's throwing back that error.
The first WITH clause in your query (WITH row) is unnecessary, but you have to add the variable to the WITH clause. So this version compiles.
LOAD CSV WITH HEADERS FROM "file:///customers_table.csv" AS row
MATCH (m:Main)
WITH m, row
MATCH (c:Customers) USING INDEX c:Customers(customer)
WHERE m.ASIN = row.asin AND c.customer = row.customer
CREATE (c)-[:RATED]->(m)
The reason for this is, that, in essence, WITH chains two query parts together, while limiting the scope to its variables (and in some cases, also performing calculations, aggregations, etc.).
Having said that, you do not even need the second WITH clause, you can just omit it and even merge the two MATCH clauses to a single one:
LOAD CSV WITH HEADERS FROM "file:///customers_table.csv" AS row
MATCH (m:Main), (c:Customers) USING INDEX c:Customers(customer)
WHERE m.ASIN = row.asin AND c.customer = row.customer
CREATE (c)-[:RATED]->(m)
I am trying to extract information coming from nodeapply( info_node) function.
I want to automate the process so that I can extract the information of a list a node ids and operate on them later.
The example as follow:
data("cars", package = "datasets")
ct <- ctree(dist ~ speed, data = cars)
node5 <-nodeapply(as.simpleparty(ct), ids = 5, info_node)
node5$`5`$n
I use the code above to extract the number of records on node 5.
I want to create a function to extract the info from a series of node:
infonode <- function(x,y){
for (j in x){
info = nodeapply(y, j, info_node)
print(info$`j`$n)
}
}
But the result always comes back as null.
I wonder if the type of "J" is wrong within the function that leads to a null read in the print.
If someone could help me it would be greatly appreciated!
Thanks
You can give nodeapply() a list of ids and then not only list with a single element will be extracted but a list of all selected nodes. This is the only partykit-specific part of your question.
From that point forward it is simply operating on standard named lists in R, without anything partykit specific about that. To address your problem you can easily use [[ indexing rather than $ indexing, either with an integer or a character index:
node5[[1]]$n
## n
## 19
node5[["5"]]$n
## n
## 19
Thus, in your infonode() function you could replace info$j$n by either info[[1]]$n or info[[as.character(j)]]$n.
However, I would simply do this with an sapply():
ni <- nodeapply(as.simpleparty(ct), ids = 3:5, info_node)
sapply(ni, "[[", "n")
## 3.n 4.n 5.n
## 15 16 19
Or some variation of this...
I am retrieving data from mysql db. All the data is one column. I need to separate this into several cols: The structure of this col is as follows:
{{product ID=001 |Country=Netherlands |Repository Link=http://googt.com |Other Relevant Information=test }} ==Description== this are the below codes: code 1 code2 ==Case Study== case study 1 txt case study 2 txt ==Benefits== ben 1 ben 2 === Requirements === (empty col) === Architecture === *arch1 *arch2
So I want cols like: Product ID, Country, Repository Link, Architecture etc.....
If you are planning on simply parsing out the output of your column, it will depend on the language of choice you are currently using.
However, in general the procedure for doing this is as follows.
1, pull output into string
2, find a delimiter(In you case it appears '|' will do)
3, you have to options here(again depending on language)
A, Split each segment into an array
1, Run array through looping structure to print out each section OR use array
to manipulate data individually(your choice)
B, In Simple String method, you can either create a new string, or replace all
instances of '|' with '\n'(new line char) so that you can display all data.
I recommend the array conversion as this will allow you to easily interact with the data in a simple manner.
This is often something done today with json and other such formats which are often stored in single fields for various reasons.
Here is an example done in php making use of explode()
$unparsed = "this | is | a | string that is | not: parsed";
$parsed = explode("|", $unparsed);
echo $parsed[2]; // would be a
echo $parsed[4]; // would be not: parsed
I am reading in a data table from a CSV file. Some elements in the CSV are in JSON format, so one of the columns has JSON formatted data, for example:
user_id tv_sec action_info
1: 47074 1426791420 {"foo": {"bar":12345,"baz":309}, "type": "type1"}
2: 47074 1426791658 {"foo": '{"bar":23409,"baz":903}, "type": "type2"}
3: 47074 1426791923 {"foo": {"bar":97241,"baz":218}, "type": "type3"}
I would like to flatten out the action_info column and add the data as columns, as follows:
user_id tv_sec bar baz type
1: 47074 1426791420 12345 309 type1
2: 47074 1426791658 23409 903 type2
3: 47074 1426791923 97241 218 type3
I am not sure how to achieve this. I found a library to convert strings to JSON in R (RJSONIO) but I'm having a hard time figuring out what to do next. When I experiment with just trying to convert all rows in the action_info column to JSON with the command userActions[,.(fromJSON(action_info))] I basically get a data table with what seems like all the values accumulated in some way that's not entirely clear to me. For example, running that with my (non-example) data I get:
V1
1: 2.188603e+12,2.187628e+12,2.186202e+12,1.164000e+03
2: type1
Warning messages:
1: In if (is.na(encoding)) return(0L) :
the condition has length > 1 and only the first element will be used
2: In if (is.na(i)) { :
the condition has length > 1 and only the first element will be used
So, I'm trying to figure out:
how to operate on the column to convert it from JSON to values (I think I am doing this correctly though, but I'm not certain)
how to get the values and create columns out of them in either the current or new data table.
Rather ugly but should work:
library(dplyr)
library(data.table)
lapply(as.character(df$action_info), RJSONIO::fromJSON) %>%
lapply(function(e) list(bar=e$foo[1], baz=e$foo[2], type=e$type)) %>%
rbindlist() %>%
cbind(df) %>%
select(-action_info)
Data:
library(data.table)
df <- data.table(structure(list(user_id = c(47074L, 47074L, 47074L), tv_sec = c(1426791420L,
1426791658L, 1426791923L), action_info = c("{\"foo\": {\"bar\":12345,\"baz\":309}, \"type\": \"type1\"}",
"{\"foo\": {\"bar\":23409,\"baz\":903}, \"type\": \"type2\"}",
"{\"foo\": {\"bar\":97241,\"baz\":218}, \"type\": \"type3\"}"
)), .Names = c("user_id", "tv_sec", "action_info"), row.names = c(NA,
-3L), class = "data.frame"))
Here's one way to do it with data_table:
df[, c('bar', 'baz', 'type'):=as.list(unlist(fromJSON(action_info[1]))),
by=action_info]
How it works:
The by=action_info essentially makes sure we just call fromJSON once per unique action_info (once per row in your case); this is because fromJSON doesn't work on vectorised input.
The fromJSON(action_info[1]) converts the action_info to JSON (the [1] is on the off chance that you have multiple rows with the same action_info since fromJSON doesn't work on vector input).
The unlist flattens the nested "foo: {bar...}" (do fromJSON(df$action_info[1]) and unlist(fromJSON(df$action_info[1])) to see what I mean).
The as.list converts the result back into a list, with one element per "column" (data.table needs this to do the multiple assignment)
Then the c('bar', 'baz', 'type'):= assigns the output back out to the columns.
Note we don't match by name, so 'bar' is always the first part of the JSON, 'baz' is always the second, etc. If your action_info could have a {bar: ..., baz: ...} as well as a {baz: ..., bar: ...} the baz of the second will be assigned to the bar column. If you want to be cleverer and assign by name, you will have to think of something cleverer (for you could do as.list(...)[c('foo.bar', 'foo.baz', 'type')] to ensure the elements are in the right order before assigning).