Graphs - find common data - language-agnostic

I've just started to read upon graph-teory and data structures.
I'm building an example application which should be able to find the xpath for the most common links. Imagine a Google serp, my application should be able to find the xpath for all links pointing to a result.
Imagine that theese xpaths were found:
/html/body/h2/a
/html/body/p/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/div[#class=footer]/span[#id=copyright]/a
From these xpats, i've thought of a graph like this (i might be completely lost here):
html
|
body
h2 - p - div[#class=footer]
| | |
a (1) a - strong span[#id=copyright]
| |
a (3) a (1)
Is this the best approach to this problem?
What would be the best way (data structure) to store this in memory? The language does not mather. We can see that we have 3 links matching the path html -> body -> p -> strong -> a.
As I said, i'm totally new to this so please forgive me if I thought of this completely wrong.
EDIT: I may be looking for the trie data structure?

Don't worry about tries yet. Just construct a tree using standard graph representation (node = {value, count, parent} while immediately collapsing same branches and incrementing the counter. Then, sort all the leaves by count in descending order and traverse from each leaf upwards to get a path.

Related

Seurat cross -species integration

I am currently working with single cell data from human and zebrafish both from brain tissue!
My assignment is to integrate them! So the steps I have followed until now :
Find human orthologs for zebrafish genes in biomart
kept only the one2one
subset the zebrafish Seurat object based on the orthlogs and replace the names with the human gene names
Create an new Object for zebrafish and run Normalization anad FindVariableFeatures
Then use this object with my human object for integration
Human object: 20620 features across 2989 samples
Zebrafish object: 6721 features across 6036 samples
features <- SelectIntegrationFeatures(object.list = double.list)
anchors <- FindIntegrationAnchors(object.list = double.list,
anchor.features = features,
normalization.method="LogNormalize",
nn.method="rann")
This identifies 2085 anchors!
I used nn.method="rann" because if I use the default I have this error
Error: C stack usage 7973252 is too close to the limit
Then I am running the integration like this
ZF_HUMAN.combined <- IntegrateData(anchorset = anchors,
new.assay.name = "integrated")
and the error I am receiving is like this
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 9265 anchors
Filtering anchors
Retained 2085 anchors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=22s
To solve this I tried to play around with the arguments in FindIntegrationAnchors
e.g i used l2.norm=F! The only things that changed is the number of anchors which decreased
I am wondering if the usage of nn.method="rann" at FindIntegrationAnchors messing things up
ANY help will be appreciated because I am struggling for a long time with that, I don't know what else to do

Extract all words from text field in mysql

I have a table that contains text fields. In those fields I store text. There are around 20 to 50 sentences in each field depending on the row. I am making an auto-complete HTML object with HTML and PHP, and I would like to start typing the beginning of a word and that the database return sentences containing those words (Like Microsoft office 2007/2010 navigation pane).
I need mysql to return those words or sentences as a separate result, so i can manipulate them further.
Example:
--------------------------------------------------------------------
| id | title |content |
--------------------------------------------------------------------
1 | test 1 | PHP is a very nice language and has nice features.
2 | test 2 | Spain is a nice country to visit and has nice language.
3 | test 3 | Perl isn\'t as nice a language as PHP.
I need mysql query to return following as different result:
1,"nice language"
1,"nice features"
2,"nice country"
2,"nice langugage"
3,"nicea a language"
Here is my sql query:
SELECT id, SUBSTR(content,POSITION('nice' IN content),50)
FROM entries
MATCH (title,entry) AGAINST ('nice' WITH QUERY EXPANSION)
New Answer
OP is actually asking nothing to do with php and javascript - his question concerns doing string manipulation directly within MySQL.
String manipulation isn't really the main focus of a DBMS. When dealing with "words" in a fluid text sense, there's a lot of logic required to determine where the next word boundary is, and you don't want your database doing this really. Plus, any queries written to do this will probably be incredibly difficult to read.
It depends exactly what you are doing, but it's quite likely that a DB only approach will be slower because there will be more function calls: SQL functions are pretty limited.
And for re-usability and best practice, what if you wanted to change your database in the future to say MongoDB? You'd need to re-write the whole damned awkward query.
No, my suggestion would be to pull the whole value using standard MySQL into PHP, throw it into PCRE, very simple regex, job done. It's better to show what you're actually doing in your PHP code as it's more "intention revealing".
At least 33% of a developer's work is picking the right tool for the job. PHP is the right tool in this example.
Original Answer
You have included the tags php and javascript, so I'm guessing (although your question needs more clarification on this) that you obviously want this 'autocomplete' running client-side. So as a result, you have to get your data from server-side to client-side first.
Twitter Bootstrap has something really cool called Typeahead. This uses JavaScript to perform (what I think) you require: the example on that page shows how you can type a country and it'll auto-complete it for you. It looks like this:
How do you get this working? Include the required JavaScript file first, and then write your HTML.
Here's some from the source code of the bootstrap page so you can see how it works:
<input type="text" data-provide="typeahead" data-items="4" data-source='["Alabama","Alaska","Arizona","Arkansas","California"]'>
Can you see how the data-source attribute is the one that gives the typeahead the information you want? You want to connect to MySQL, grab your data, and shove these into the data-source array for the JavaScript to work with, as above.
So, on your page load, you connect to MySQL and you pull all the relevant strings you would like to be "auto-complete-able" from the Database. You then put these as new Data attributes for the typeahead, and that's pretty much it!
--
Edit: There's a fork of twitter bootstrap's typeahead that allows AJAX calls, so you could use this to perform the data retrieval asynchronously (if you can figure it out, I'd recommend this approach).

Traverse a DAG (Directed Acyclic Graph) from specified node to create a tree view

I am putting together a parts database using the method below for directed acyclic graphs.
http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o
I am able to build my data set using the SQL queries from that page which I have converted to MySQL.
Previously I have used the nested sets model although we found that deletions became a problem.
I am unable to find any information on how to traverse the tree using this model. I simply need to be able to create a html tree to show the descendants from a selected parent node and identify leaf nodes (will be using jstree).
I can post the code from the nested sets model if that helps. I don't need any help with the html it is the SQL I am stuck with.
Does anyone have any idea where I can find information on the query I need.
EDIT:
Following on from the commments I'd like to adapt to something more closely linked to Bill Karwins closure model. http://www.slideshare.net/billkarwin/models-for-hierarchical-data
I notice however that on slides 49-50 which is where I want to select the descendants of a node that the output doesn't seem to provide enough to draw a simple tree. Previously with the nested sets model I was able to get a similar output that would traverse left to right, top to bottom. I'll try to explain.
Item | Depth
1 | 0
2 | 0
3 | 1
6 | 2
7 | 0
9 | 1
This allowed me to draw a tree as the SQL listed the order of descendants in a more manipulatable way. I believe it created "depth" by using a COUNT of subtrees and I will dig out the query if it would be useful here.
Thanks again for all your help.

Is it any way to implement a linked list with indexed access too?

I'm in the need of sort of a linked list structure, but if it had indexed access too it would be great.
Is it any way to accomplish that?
EDIT: I'm writing in C, but it may be for any language.
One method of achieving your goal is to implement a random or deterministic skip list. On the bottom level - you have your linked list with your items.
In order to get to elements using indexes, you'll need to add information to the inner nodes - of how many nodes are in the low most level, from this node until the next node on this level. This information can be added and maintained in O(logn).
This solution complexity is:
Add, Remove, Go to index, all work in O(logn).
The down side of this solution is that it is much more difficult to implement than the regular linked list. So using a regular linked list, you get Add, Remove in O(1), and Go to index in O(n).
You can probably use a tree for what you are aiming at. Make a binary tree that maintains the weights of each node of the tree (where the weight is equal to the number of nodes attached to that node, including itself). If you have a balancing scheme available for the tree, then insertions are still O(log n), since you only need to add one to the ancestor nodes' weights. Getting a node by index is O(log n), since you need only look at the indices of the ancestors of your desired node and the two children of each of those ancestors.
For achieving array like indexing in languages like C++, Java, Python, one would have to overload the array indexing operator [] for a class which implements the linked list data structure. The implementation would be O(n). In C since operator overloading is not possible, hence one would have to write a function which takes the linked list data structure and a position and returns the corresponding object.
In case a faster order access is required, one would have to use a different data structure like the BTree suggested by jprete or a dynamic array (which automatically grows as and when new elements are added to it). A quick example would be std::vector in C++ standard library.
SQL server row items in the clustered index are arranged like so:
.
/ \
/\ /\
*-*-*-*
The linked list is in the leaves (*-*-*). The linked list is ordered allowing fast directional scanning, and the tree serves as a `road-map' into the linked-list. So you would need a key-value pair for your items and then a data structure that encapsulates the tree and linked list.
so your data structure might look something like this:
struct ll_node
{
kv_pair current;
ll_node * next;
};
struct tree_node
{
value_type value;
short isLeaf;
union
{
tree_node * left_child;
kv_pair * left_leaf;
}
union
{
tree_node * right_child;
kv_pair * right_leaf
}
};
struct indexed_ll
{
tree_node * tree_root;
ll_node * linked_list_tail;
};

How can I extract the data out of a typical html day/time schedule?

I'm trying to write a parser to get the data out of a typical html table day/time schedule (like this).
I'd like to give this parser a page and a table class/id, and have it return a list of events, along with days & times they occur. It should take into account rowspans and colspans, so for the linked example, it would return
{:event => "Music With Paul Ray", :times => [T 12:00am - 3:00am, F 12:00am - 3:00am]}, etc.
I've sort of figured out a half-executed messy approach using ruby, and am wondering how you might tackle such a problem?
The best thing to do here is to use a HTML parser. With a HTML parser you can look at the table rows programmatically, without having to resort to fragile regular expressions and doing the parsing yourself.
Then you can run some logic along the lines of (this is not runnable code, just a sketch that you should be able to see the idea from):
for row in table:
i = 0
for cell in row: # skipping row 1
event = name
starttime = row[0]
endtime = table[ i + cell.rowspan + 1 ][0]
print event, starttime, endtime
i += 1
This is what the program will need to do:
Read the tags in (detect attributes and open/close tags)
Build an internal representation of the table (how will you handle malformed tables?)
Calculate the day, start time, and end time of each event
Merge repeated events into an event series
That's a lot of components! You'll probably need to ask a more specific question.
Use http://www.crummy.com/software/BeautifulSoup/ and that task should be a breeze.
As said, using regexes on HTML is generally a bad idea, you should use a good parser.
For validating XHTML pages, you can use a simple XML parser which is available in most languages. Alas, in your case, the given page doesn't validate (W3C's markup validation service reports 230 Errors, 7 warning(s)!)
For generic, possibly malformed HTML, there are libraries to handle that (kigurai recommends BeautifulSoup for Python, I know also TagSoup for Java, there are others).