Extract Q&A pairs from the XML stackexchange dumps

Extract Q&A pairs from the XML stackexchange dumps - mysql

I am wanting to extract the question/answer pairs from https://archive.org/download/stackexchange, specifically from the Posts.xml file from any of the dumps (I randomly chose the Anime dump as it was fairly small and close to the top). My understanding of how this file is layed out is that there are two PostTypeId types, 1 being the question (includes the body of the question, title, and other meta data) and 2 being the answer (includes score, the body of the answer, and other meta data).
The data relates easily enough where if we have an entry such as
<row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="69" ViewCount="22384" Body="<p>Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.</p>
<p>The Straw Hats started out from the first half and are now sailing across the second half.</p>
<p>Wouldn't it have been quicker to set sail in the opposite direction from where they started? </p>
" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="<one-piece>" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
The corresponding answer would be:
<row Id="8" PostTypeId="2" ParentId="1" CreationDate="2012-12-11T20:47:52.167" Score="60" Body="<p>No, there is a reason why they can't. </p>
<p>Basically the <a href="http://onepiece.wikia.com/wiki/New_World">New World</a> is beyond the <a href="http://onepiece.wikia.com/wiki/Red_Line">Red Line</a>, but you cannot "walk" on it, or cross it. It's a huge continent, very tall that you cannot go through. You can't cross the <a href="http://onepiece.wikia.com/wiki/Calm_Belt">Calm Belt</a> either, unless you have some form of locomotion such as the Navy or <a href="http://onepiece.wikia.com/wiki/Boa_Hancock">Boa Hancock</a>.</p>
<p>So the only way is to start from one of the Four Seas, then to go the <a href="http://onepiece.wikia.com/wiki/Reverse_Mountain">Reverse Mountain</a> and follow the Grand Line until you reach <em><a href="http://onepiece.wikia.com/wiki/Raftel">Raftel</a></em>, which supposedly is where One Piece is located.</p>
<p><img src="http://i.stack.imgur.com/69IZ0.png" alt="enter image description here"></p>
" OwnerUserId="15" LastEditorUserId="1528" LastEditDate="2013-05-06T19:21:04.703" LastActivityDate="2013-05-06T19:21:04.703" CommentCount="1" />
Where inside the first xml snippet PostTypeId="1" indicates that this row is a question and AcceptedAnswerId="8" indicates the Id of the answer. And in the second xml snippet we have the Id="8" being the AcceptedAnswerId from the question, PostTypeId="2" indicating that this is an answer, and ParentId being the questions Id.
Now with this being said how could I easily poll this data for the question/answer pairs. Ideally it would be useful if I could convert this to a SQLite3 or Mysql database where I am familiar with these kinds of data structures. If that is not possible (either through the database functions itself or through a scripted wrapper to populate the database) how would I parse this data in Ruby so that I can go through the entire XML document extracting the title and body of the question, then pair it with the appropriate answer body.
Thanks for your time.

The Stack Exchange Creative Commons Data Dump is just a (sanitized) dump from the Stack Exchange production Microsoft SQL Server database. So, considering that the data came from a SQL database and truly is relational data, you can import it back into one.
The database schemata are described in the Data Dump's README, and you can find some old scripts for importing it into a database on Meta Stack Exchange. Of course, if all you want is SQL-like relational query interface, you could just use the Stack Exchange Data Explorer.

Related

Wikipedia Api get amounts of words

I'm a bit stuck in all the options the Wikipedia api has.
My goals is to get the amount of words of an wikipedia page.
I have the url of the wiki.
The search option does return this value:
http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srsearch=camera&srlimit=1
Wil return
<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="68658"/>
<search>
<p ns="0" title="Camera" snippet="A <span class='searchmatch'>camera</span> is an optical instrument that records image s that can be stored directly, transmitted to another location, or both. <b>...</b> " size="43246" wordcount="6348" timestamp="2014-04-29T15:48:07Z"/>
</search>
</query>
</api>
(scroll a bit to the right and you find wordcount
But this query is making a search and shows 1 top result. However, when I search on the wikipedia name in the URL, it doesnt always find that record as the first result.
So is there a way to get this wordcount a Wikipedia page?

No other APIs provide this information, so the kludge with list=search is the only way. If you know the exact title you can get better results by appending &srwhat=nearmatch to the query (it will always return 1 result though). See the docs and try the sandbox to learn more.
Note that word counts are not stored in database so the API has to go to Lucene/Elasticsearch for this information which is not exactly fast, so if you need this information en masse you should download a dump instead.

How to persist this json in database?

My Json code are as below
[
{"group":{"key":"Chinese","title":"Chinese","shortTitle":"Chinese","recipesCount":0,"description":"Chinese cuisine is any of several styles originating from regions of China, some of which have become increasingly popular in other parts of the world – from Asia to the Americas, Australia, Western Europe and Southern Africa. The history of Chinese cuisine stretches back for many centuries and produced both change from period to period and variety in what could be called traditional Chinese food, leading Chinese to pride themselves on eating a wide range of foods. Major traditions include Anhui, Cantonese, Fujian, Hunan, Jiangsu, Shandong, Szechuan, and Zhejiang cuisines. ","rank":"","backgroundImage":"images/Chinese/chinese_group_detail.png", "headerImage":"images/Chinese/chinese_group_header.png"},
"key":1000,
"title":"Abalone Egg Custard",
"shortTitle" : "Abalone Egg Custard",
"serves":4,
"perServing":"65kcal / 2.2g fat",
"favorite":false,
"rating": 3 ,
"directions":["Step 1.","Step 2.","Step 3.","Step 4.","Step 5."],
"backgroundImage":"images/Chinese/AbaloneEggCustard.jpg",
"healthytips":["Tip 1","Tip 2","Tip 3"],
"nutritions":["Calories 65kcal","Total Fat 2.2g","Carbs 4g","Cholesterol 58mg","Sodium 311mg","Dietary Fibre 0.3g"],
"ingredients":["1 head Napa or bok choy cabbage","1/4 cup sugar","1/4 teaspoon salt","3 tablespoons white vinegar","3 green onions","1 (3-ounce) package ramen noodles with seasoning pack","1 (6-ounce) package slivered almonds","1 tablespoon sesame seeds","1/2 cup vegetable oil"]}
]
how am I going to persist this in database? Cause the end of the day I have to read from the database and able to parse it using webapi

Persist it as a CLOB data type in your database in the likely event that the length is going to exceed the limits of a varchar.

There are so many potential answers here -- you'll need to provide many more details to get a specific answer.
Database
What database are you using -- is it relation, object, no-sql? If you come from a no-sql perspective -- saving it as a lump is likely fine. From a RDBMS perspective (like SQL Server), you map all the fields down to a series of rows in a set of related tables. If you're using a relation database, just jamming an unparsed, unvalidated lump of JSON text in the database is the wrong way to go. Why bother hiring a database that provides DRI at all.
Data Manipulation Layer
Included in your question is what type of data manipulation you'll use -- could be linq to sql, could be straight ADO, a micro ORM like Dapper, Massive, or PetaPoco, a full blown ORM like Entity Framework or NHibernate.
Have you picked one of these or are you looking for guidance on selecting one?
Parsing in WebAPI
Convering from JSON to an Object or an Object to JSON is easy in WebApi. For JSON specifically, the JSON.Net formatter is hanging around. You can get started by looking here, here, here, and here.
Conceptually, however, it sounds like you're missing part of the magic of WebAPI. With WebAPI you return your object in it's native state (or IQueryable if you want OData support). After your function call finishes the Formatter's take over and serialize it into the proper shape based on the client request. This process is called Content Negotiation. The idea is that your methods are format agnostic and the framework serializes the data into the transport format your client wants (xml,json, whatever).
The reverse is true too, where the framework deserializes the format provided by the client into a native object.

What data format is this?

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>

I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.

It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

Parsing HTML content into a MySQL database using a parser

I want to be able to parse specific content from a website into a mySQL database. For example, on site http://allrecipes.com/Recipe/Fluffy-Pancakes-2/Detail.aspx I want to parse into my database (which has a table with columns RecipeName, Ingredients 1-10).
So basically my database will contain the name and all the ingredients for that recipe. There is no need to edit the content, simply parse them in as is (i.e. 3/4 cup milk) since i am using character in my database.
How exactly do I go about doing this? I was looking a pre-built parsers and it seems its tough to find one that's easy to use since I am fairly new to programming. Of course, I can manually enter values in but I want to parse them in.
Would it be possible to just parse this content and write a file that has a RecipieName, Ingredient string which I can then parse into my database? Or should I just do it directly into the database? I am unsure as to how to connect a database to a parser also directly, but I might be able to find some information online.
Basically, I am looking for help on how to exactly go about doing this since I am not very well versed in programming and this seems to be a lot more complicated than it might be.
I am using Java as my main language right now, although I can't say I am very good at it. But I should be able to understand the basic concepts.
Any suggestions on what parser to use or how to do this?
Thanks!

This is how I would do it in PHP. This is almost certainly NOT the most efficient way to do it, nor has it been debugged.
function parseHTML($rawHTML){
$startPosition = strpos($rawHTML,'<div class="ingredients"'); //Find the position of the beginning of the ingredients list, return the character number.
$endPosition = strpos($rawHTML,'</div>',$startPosition); //Find the position of the end of the ingredients list, begin searching from the beginning of the list (found in step 1)
$relevantPart = substr($rawHTML,$startPosition,$endPosition); //Isolate the ingredients list
$parsedString = strip_tags($relevantPart); //Strip the HTML tags off of the ingredients list
return $parsedString;
}
Still to be done: You say you have a mySQL database with 10 separate ingredients columns. This code outputs everything as one big string. You would have to change the strip_tags($relevantPart) function to strip_tags($relevantPart,"<li>"). That would let the <li> tags through. Then, you would have to loop through every <li> tag, performing a similar function to this. It shouldn't be too hard, but I don't feel comfortable writing it with no functioning PHP server.

Creating MySQL tables from XML data

I'm working with a set of lobbying disclosure records. The Secretary of the Senate publishes these records as XML files, which look like this:
<Filing ID="1ED696B6-B096-4591-9181-DA083921CD19" Year="2010" Received="2011-01-01T11:33:29.330" Amount="" Type="LD-203 YEAR-END REPORT" Period="Year-End (July 1 - Dec 31)">
<Registrant xmlns="" RegistrantID="8772" RegistrantName="CERIDIAN CORPORATION" Address="4524 Cheltenham Drive
Bethesda, MD 20814" RegistrantCountry="USA"/>
<Lobbyist xmlns="" LobbyistName="O'CONNELL, JAMES"/>
</Filing>
<Filing ID="179345CF-8D41-4C71-9C19-F41EB88254B5" Year="2010" Received="2011-01-01T13:48:31.543" Amount="" Type="LD-203 YEAR-END AMENDMENT" Period="Year-End (July 1 - Dec 31)">
<Registrant xmlns="" RegistrantID="400447142" RegistrantName="Edward Merlis" Address="8202 Hunting Hill Lane
McLean, VA 22102" RegistrantCountry="USA"/>
<Lobbyist xmlns="" LobbyistName="Merlis, Edward A"/>
<Contributions>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="DeFazio for Congress" Honoree="Cong. Peter DeFazio" Amount="250.0000" ContributionDate="2010-09-05T00:00:00"/>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="Friends of Jim Oberstar" Honoree="Cong. Jim Oberstar" Amount="1000.0000" ContributionDate="2010-09-01T00:00:00"/>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="McCaskill for Missouri 2012" Honoree="Senator Claire McCaskill" Amount="1000.0000" ContributionDate="2010-09-18T00:00:00"/>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="Mesabi Fund" Honoree="Cong. Jim Oberstar" Amount="500.0000" ContributionDate="2010-07-13T00:00:00"/>
</Contributions>
</Filing>
As you can see, some <Filing> tags also contain <Contribution> tags, but others do not.
I see two objects here: contributors (i.e., lobbyists) and contributions (i.e., a transaction between a lobbyist and a member of Congress).
I'd like to load these records into a MySQL database. To me, the logical structure would include two tables: one for contributors (with fields for name, ID, address, etc.) and one for contributions (with amount, recipient, etc., and a relational link to the list of contributors).
My question: am I approaching this problem correctly? If so, does this data schema make sense? Finally, how am I to parse the XML to load it into the MySQL tables as I've structured them?

Solved: I'm using a Python SAX parser to process the XML file.

If you are using MySQL version 5.5 you may find the LOAD XML command useful.
That being said, LOAD XML appears to be geared towards loading data into a single table for a given XML file, so it may not work for your specific files.

Tradiional approach for these kind of problems is to use an ETL tool.
Do you already have such tool (E.g. Informatica / Talend) in your organization?
Another approach is to write a small utility to parse these XMLs and load this data by creation of master detail relationships in MySQL.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract Q&A pairs from the XML stackexchange dumps - mysql

Related

Wikipedia Api get amounts of words

How to persist this json in database?

What data format is this?

Parsing HTML content into a MySQL database using a parser

Creating MySQL tables from XML data

Categories

Resources