Most "popular" Python repos on GitHub - json

Based on the v3 documentation I would have thought that this:
$ curl https://api.github.com/legacy/repos/search/python?language=Python&sort=forks&order=desc
would return the top 100 Python repositories in descending order of number of forks. It actually returns an empty (json) list of repositories.
This:
$ curl https://api.github.com/legacy/repos/search/python?language=Python&sort=forks
returns a list of repositories (in json), but many of them are not listed as Python repositories.
So, clearly I have misunderstood the Github API. What is the accepted way of retrieving the top N repositories for a particular language?

As pengwynn said -- currently this is not easily doable via GitHub's API alone. However, have a look at this alternative way of querying using the GitHub Archive project: How to find the 100 largest GitHub repositories for a past date?
In essence, you can query GitHub's historical data using an SQL-like language. So, if having real-time results is not something that is important for you, you could execute the following query on https://bigquery.cloud.google.com/?pli=1 to get the top 100 Python repos as on April 1st 2013 (yesterday), descending by the number of forks:
SELECT MAX(repository_forks) as forks, repository_url
FROM [githubarchive:github.timeline]
WHERE (created_at CONTAINS "2013-04-01" and repository_language = "Python")
GROUP BY repository_url
ORDER BY forks
DESC LIMIT 100
I've put the results of the query in this Gist in CSV format, and the top few repos are:
forks repository_url
1913 https://github.com/django/django
1100 https://github.com/facebook/tornado
994 https://github.com/mitsuhiko/flask
...

The intent of Repository Search API is to find Repositories by keyword and then further filter those results by the other optional query string parameters.
Since you're missing a ?, you're passing the entire intended query string as the :keyword. I'm sorry, we do not support your intended search via the GitHub API at this time.

Try this:
https://api.github.com/search/repositories?q=language:Python&sort=forks&order=desc
It is searching over repositories.

Related

Azure Datafactory process and filter files to process

I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel
contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))

Find all possible paths between two nodes on graph using a graph database

I have a collection of nodes that make up a DAG (directed acyclic graph) with no loops guaranteed. I want to store the nodes in a database and have the database execute a search that shows me all paths between two nodes.
For example, you could think that I have the git history of a complex project.
Each node can be described with a JSON object that has:
{'id':'id',
'outbound':['id1','id2','id3']}
}
So if I had these nodes in the database:
{'id':'id0',
'outbound':['id1','id2']}
}
{'id':'id1',
'outbound':['id2','id3','id4','id5,'id6']}
}
{'id':'id2',
'outbound':['id2','id3'}
}
And if I wanted to know all of the paths connecting id0 and id3, I would want to get three lists:
id0 -> id1 -> id3
id0 -> id2 -> id3
id0 -> id1 -> id2 -> id3
I have thousands of these nodes today, I will have tens of thousands of them tomorrow. However, there are many DAGs in the database, and the typical DAG only has 5-10 nodes, so this problem is tractable.
I believe that there is no way to do this efficiently MySQL (right now all of the objects are stored in a table in a JSON column), however I believe that it is possible to do it efficiently in a graph database like Neo4j.
I've looked at the Neo4J documentation on Path Finding Algorithms and perhaps I'm confused, but the examples don't really look like working examples. I found a MySQL example which uses stored procedures and it doesn't look like it parallelizes very well. I'm not even sure what Amazon Neptune is doing; I think that it is using Spark GraphX.
I'm sort of lost as to where to start on this.
It's perfectly doable with Neo4j.
Importing json data
[
{"id":"id0",
"outbound":["id1","id2"]
},
{"id":"id1",
"outbound":["id2","id3","id4","id5","id6"]
},
{"id":"id2",
"outbound":["id2","id3"]
}
]
CALL apoc.load.json("graph.json")
YIELD value
MERGE (n:Node {id: value.id})
WITH n, value.outbound AS outbound
UNWIND outbound AS o
MERGE (n2:Node {id: o})
MERGE (n)-[:Edge]->(n2)
Apparently the data you provided is not acyclic...
Getting all paths between two nodes
As you are not mentioning shortest paths, but all paths, there is no specific algorithm required:
MATCH p=(:Node {id: "id0"})-[:Edge*]->(:Node {id: "id3"}) RETURN nodes(p)
"[{""id"":id0},{""id"":id1},{""id"":id3}]"
"[{""id"":id0},{""id"":id2},{""id"":id3}]"
"[{""id"":id0},{""id"":id1},{""id"":id2},{""id"":id3}]"
"[{""id"":id0},{""id"":id2},{""id"":id2},{""id"":id3}]"
"[{""id"":id0},{""id"":id1},{""id"":id2},{""id"":id2},{""id"":id3}]"
Comparaison with MySql
See how-much-faster-is-a-graph-database-really
The Graph Data Science library pathfinding algorithms are designed to find the shortest weighted paths and use algorithms similar to Dijkstra to find them. In your case, it seems that you are dealing with a directed unweighted graph and you could use the native cypher allShortestPath procedure:
An example would be:
MATCH (n1:Node{id:"A"}),(n2:Node{id:"B"})
MATCH path=allShortestPaths((n1)-[*..10]->(n2))
RETURN [n in nodes(path) | n.id] as outbound_nodes_id
It is always useful to check the Cypher refcard to see what is available with Cypher in Neo4j

Using cts query to retrieve collections associated with the given document uri- Marklogic

I need to retrieve the collections to which a given document belongs in Marklogic.
I know xdmp command does that. But I need to use it in cts query to retrieve the data and then filter records from it.
xdmp:document-get-collections("uri of document") can't be run inside cts-query to give appropriate data.
Any idea how can it be done using cts query?
Thanks
A few options come to mind:
Option One: Use cts:values()
cts:values(cts:collection-reference())
If you check out the documentation, you will see that you can also restrict this to certain fragments by passing a query as one of the parameters.
**Update: [11-10-2017]
The comment attached to this asked for a sample of restricting the results of cts:values() to a single document(for practical purposes, I will say fragment == document)
The documentation for cts:values explains this. It is the 4th parameter - a query to restrict the results. Get to know this pattern as it is part of many features of MarkLogic. It is your friend. The query I would use for this problem statement would be a cts:document-query();
An Example:
cts:values(
cts:collection-reference(),
(),
(),
cts:document-query('/path/to/my/document')
)
Full Example:
cts:search(
collection(),
cts:collection-query(
cts:values(
cts:collection-reference(),
(),
(),
cts:document-query('/path/to/my/document')
)
)
)[1 to 10]
Option two: use cts:collection-match()
Need more control over returning just some of the collections from a document, then use cts:colection-match(). Like the first option, you can restrict the results to just some fragments. However, it has the benefit of having an option for a pattern.
Attention:
They both return a sequence - perfect for feeding into other parts of your query. However, under the hood, I believe they work differently. The second option is run against a lexicon. The larger the list of unique collection names and the more complex your pattern match, the longer for resolution. I use collection-match in projects. However, I usually use it when I can limit the possible choices by restricting the results to a smaller number of documents.
You can't do this in a single step. You have to run code first to retrieve collections associated with a document. You can use something like xdmp:document-get-collections for that. You then have to feed that into a cts query that you build dynamically:
let $doc-collections := xdmp:document-get-collections($doc-uri)
return
cts:search(collection(), cts:collection-query($doc-collections))[1 to 10]
HTH!
Are you looking for cts:collection-query()?
Insert two XML files to the same collection:
xquery version "1.0-ml";
xdmp:document-insert("/a.xml", <root><sub1><a>aaa</a></sub1></root>,
map:map() => map:with("collections", ("coll1")));
xdmp:document-insert("/b.xml", <root><sub2><a>aaa</a></sub2></root>,
map:map() => map:with("collections", ("coll1")));
Search the collection:
xquery version "1.0-ml";
let $myColl:= xdmp:document-get-collections("/a.xml")
return
cts:search(/root,
cts:and-query((cts:collection-query($myColl),cts:element-query(xs:QName("a"),"aaa")
)))

How to get extracts of revisions of wikipedia articles

I would like to get extracts for old versions (revisions) of wikipedia articles. This question shows how to get the content in a json format. In particular it uses prop=extracts then explaintext= to return the content. I would like to do the same but for a revision (using revid=*) of an article.
This is not possible using only TextExtracts. See RFE T66546 Support revisions, which was declined. In fact, if you try to specify revids instead of titles, then TextExtracts will return the extract(s) for the current revision(s) of the corresponding title(s). For example:
action=query&prop=extracts&exchars=100&explaintext&revids=342428310
has equivalent output to:
action=query&prop=extracts&exchars=100&explaintext&titles=Earth
You would need to figure out how TextExtracts prepares an extract and do the same for the revision of interest.

Drive API files.list query with 'not' parameter returns empty pages

I'm using the Drive API to list files from a collection which do not contain a certain string in their title.
My query looks something like this:
files().list(q="'xxxxx' in parents and not title contains 'toto'")
In my drive collection, I have 100 files, all contain the string "toto" in their title except for let's say 10 files.
I'm using pagination to retrieve the results 20 by 20, so I'm expecting to get only one page with the 10 files corresponding to my request. Surprisingly, the API returns 5 pages, with the first 4 having no results but with a nextToken page, and the files which are compliant with my request only come with the fifth page.
I'm still trying some use-cases here but it seems that it has something to do with the "not" operator. Like if the request was made without it, therefore returning 5 pages, but the results not corresponding to the request being removed from the response. It's very disturbing for me as I'm looking for the best performance here, and obviously having to make 5 requests to Drive instead of one single is not good for me. I'm also noticing that the results don't always come in the last page. I made the test with another collection, the results show up in the second page, but I still get 3 empty pages after that.
Am I missing something here ? Is this kind of behaviour "normal" ? I mean imagine if I had 1000 documents in my collection, having to make 50 requests to find only a few is not what I expect.
I have similar problem in files.list API. I tried to receive all three folders under root folder. I received result only on 342nd page. After several hours of researching I found some regularity in this strange behavior.
As I understood, the Drive API works in this way:
Detects something like index that best match your query
Selects first 20 records using index from step 1
Applies your filter: removes records that do not match your query
Rest is returned to you (maybe empty) with next page token.
The nextPageToken is looks like just OFFSET for the first record on next page in decided index, maybe it contains some information about query or index.
After base64 decode this token I found appropriate record number for next result in 121st position in decoded token.
Previously I built index of tokens using maxResults=1.
This is crazy, but I have no other explanation for observable behavior.
It is very useful for server because server do a very small work for search. From other side this algorithm must produce a lot of requests for pagenate whole list. But limitation for requests per second solve this problem.
Only You can do is pagenage and skip empty results. Do not forget about limitation of number of requests.
Do not try to find errors on your side. This is how Google Drive API works.
contains operator is working as a prefix matcher at the moment.title contains 'toto' will match "totolong" and "toto", but not "blahtoto".