Creating MySQL tables from XML data - mysql

I'm working with a set of lobbying disclosure records. The Secretary of the Senate publishes these records as XML files, which look like this:
<Filing ID="1ED696B6-B096-4591-9181-DA083921CD19" Year="2010" Received="2011-01-01T11:33:29.330" Amount="" Type="LD-203 YEAR-END REPORT" Period="Year-End (July 1 - Dec 31)">
<Registrant xmlns="" RegistrantID="8772" RegistrantName="CERIDIAN CORPORATION" Address="4524 Cheltenham Drive
Bethesda, MD 20814" RegistrantCountry="USA"/>
<Lobbyist xmlns="" LobbyistName="O'CONNELL, JAMES"/>
</Filing>
<Filing ID="179345CF-8D41-4C71-9C19-F41EB88254B5" Year="2010" Received="2011-01-01T13:48:31.543" Amount="" Type="LD-203 YEAR-END AMENDMENT" Period="Year-End (July 1 - Dec 31)">
<Registrant xmlns="" RegistrantID="400447142" RegistrantName="Edward Merlis" Address="8202 Hunting Hill Lane
McLean, VA 22102" RegistrantCountry="USA"/>
<Lobbyist xmlns="" LobbyistName="Merlis, Edward A"/>
<Contributions>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="DeFazio for Congress" Honoree="Cong. Peter DeFazio" Amount="250.0000" ContributionDate="2010-09-05T00:00:00"/>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="Friends of Jim Oberstar" Honoree="Cong. Jim Oberstar" Amount="1000.0000" ContributionDate="2010-09-01T00:00:00"/>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="McCaskill for Missouri 2012" Honoree="Senator Claire McCaskill" Amount="1000.0000" ContributionDate="2010-09-18T00:00:00"/>
<Contribution xmlns="" Contributor="Merlis, Edward A" ContributionType="FECA" Payee="Mesabi Fund" Honoree="Cong. Jim Oberstar" Amount="500.0000" ContributionDate="2010-07-13T00:00:00"/>
</Contributions>
</Filing>
As you can see, some <Filing> tags also contain <Contribution> tags, but others do not.
I see two objects here: contributors (i.e., lobbyists) and contributions (i.e., a transaction between a lobbyist and a member of Congress).
I'd like to load these records into a MySQL database. To me, the logical structure would include two tables: one for contributors (with fields for name, ID, address, etc.) and one for contributions (with amount, recipient, etc., and a relational link to the list of contributors).
My question: am I approaching this problem correctly? If so, does this data schema make sense? Finally, how am I to parse the XML to load it into the MySQL tables as I've structured them?

Solved: I'm using a Python SAX parser to process the XML file.

If you are using MySQL version 5.5 you may find the LOAD XML command useful.
That being said, LOAD XML appears to be geared towards loading data into a single table for a given XML file, so it may not work for your specific files.

Tradiional approach for these kind of problems is to use an ETL tool.
Do you already have such tool (E.g. Informatica / Talend) in your organization?
Another approach is to write a small utility to parse these XMLs and load this data by creation of master detail relationships in MySQL.

Related

Read a CSV file in prolog and format the data into predicates of the form: prefers(employer 1, [Student1, Student2, Student3])

I have been trying to figure this out for days now. I have a two csv files. One with the names of employers and students in order of the employer's preferences. The other with the names of students and the employers they wish to work form, again in order of preference. The files look like this:
Thales, Melissa, Craig, Luke
US Post, Craig, Luke, Melissa
IBM, Luke, Melissa, Craig
and
Melissa, Thales, US Post, IBM
Craig, IBM, Thales, US Post
Luke, US Post, IBM, Thale
I need to read this file line by line and populate a prolog database with the predicates like:
prefers(Thales, [Melissa, Craig, Luke]).
prefers(Craig, [IBM, Thales, US Post]).
student(Melissa).
employer(IBM).
I know how to open the file, read it, and create a stream, but I have no clue how to do what I asked for in this question. The solution is most likely DCG based.

Understanding openaddresses data format

I have downloaded us-west geolocation data (postal addresses) from openaddresses.io. Some of the addresses in the datasets are not complete i.e., some of them doesn't have info like zip_code. Is there a way to retrieve it or is the data incomplete?
I have tried to search other files hoping to find any related info. The complete dataset doesn't contain any info relate to it. City of Mesa, AZ has multiple zip codes, so it is hard to assign one to the address. Is there any way to address this problem?
This is how data looks like (City of Mesa, AZ)
LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
-111.8747353,33.456605,790,N DOBSON RD,,SRPMIC,,,,,dc0c53196298eb8d
-111.8886227,33.4295194,2630,W RIO SALADO PKWY,,MESA,,,,,c38b700309e1e9ce
-111.8867018,33.4290795,2401,E RIO SALADO PKWY,,TEMPE,,,,,9b912eb2b1300a27
-111.8832045,33.4232903,700,S EVERGREEN RD,,TEMPE,,,,,3435b99ab3f4f828
-111.8761202,33.4296416,2100,W RIO SALADO PKWY,,MESA,,,,,b74349c833f7ee18
-111.8775844,33.4347782,1102,N RIVERVIEW,,MESA,,,,,17d0cf1542c66083
Short Answer: The data incomplete.
The data in OpenAddresses.io is only as complete as the datasource it pulls from. OpenAddresses is just an aggregation of publicly available datasets. There's no real consistency between government agencies that make their data available. As a result, other sections of the OpenAddresses dataset might have city names or zip codes, but there's often something missing.
If you're looking to fill in the missing data, take a look at how projects like Pelias use multiple data sources to augment missing data.
Personally, I always end up going back to OpenStreetMaps (OSM). One could argue that OpenAddresses is better quality because it comes from official sources and doesn't try to fill in data using approximations, but the large gaps of missing data make it far less useful, at least on its own.

Extract Q&A pairs from the XML stackexchange dumps

I am wanting to extract the question/answer pairs from https://archive.org/download/stackexchange, specifically from the Posts.xml file from any of the dumps (I randomly chose the Anime dump as it was fairly small and close to the top). My understanding of how this file is layed out is that there are two PostTypeId types, 1 being the question (includes the body of the question, title, and other meta data) and 2 being the answer (includes score, the body of the answer, and other meta data).
The data relates easily enough where if we have an entry such as
<row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="69" ViewCount="22384" Body="<p>Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.</p>
<p>The Straw Hats started out from the first half and are now sailing across the second half.</p>
<p>Wouldn't it have been quicker to set sail in the opposite direction from where they started? </p>
" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="<one-piece>" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
The corresponding answer would be:
<row Id="8" PostTypeId="2" ParentId="1" CreationDate="2012-12-11T20:47:52.167" Score="60" Body="<p>No, there is a reason why they can't. </p>
<p>Basically the <a href="http://onepiece.wikia.com/wiki/New_World">New World</a> is beyond the <a href="http://onepiece.wikia.com/wiki/Red_Line">Red Line</a>, but you cannot "walk" on it, or cross it. It's a huge continent, very tall that you cannot go through. You can't cross the <a href="http://onepiece.wikia.com/wiki/Calm_Belt">Calm Belt</a> either, unless you have some form of locomotion such as the Navy or <a href="http://onepiece.wikia.com/wiki/Boa_Hancock">Boa Hancock</a>.</p>
<p>So the only way is to start from one of the Four Seas, then to go the <a href="http://onepiece.wikia.com/wiki/Reverse_Mountain">Reverse Mountain</a> and follow the Grand Line until you reach <em><a href="http://onepiece.wikia.com/wiki/Raftel">Raftel</a></em>, which supposedly is where One Piece is located.</p>
<p><img src="http://i.stack.imgur.com/69IZ0.png" alt="enter image description here"></p>
" OwnerUserId="15" LastEditorUserId="1528" LastEditDate="2013-05-06T19:21:04.703" LastActivityDate="2013-05-06T19:21:04.703" CommentCount="1" />
Where inside the first xml snippet PostTypeId="1" indicates that this row is a question and AcceptedAnswerId="8" indicates the Id of the answer. And in the second xml snippet we have the Id="8" being the AcceptedAnswerId from the question, PostTypeId="2" indicating that this is an answer, and ParentId being the questions Id.
Now with this being said how could I easily poll this data for the question/answer pairs. Ideally it would be useful if I could convert this to a SQLite3 or Mysql database where I am familiar with these kinds of data structures. If that is not possible (either through the database functions itself or through a scripted wrapper to populate the database) how would I parse this data in Ruby so that I can go through the entire XML document extracting the title and body of the question, then pair it with the appropriate answer body.
Thanks for your time.
The Stack Exchange Creative Commons Data Dump is just a (sanitized) dump from the Stack Exchange production Microsoft SQL Server database. So, considering that the data came from a SQL database and truly is relational data, you can import it back into one.
The database schemata are described in the Data Dump's README, and you can find some old scripts for importing it into a database on Meta Stack Exchange. Of course, if all you want is SQL-like relational query interface, you could just use the Stack Exchange Data Explorer.

Using the nltk to recognise dates as named entities?

I'm trying to use the NLTK Named Entity Tagger to identify various named entities. In the book Natural Language Processing with Python they provide a list of commonly used named entitities, (Table 7.4, if anyone is curious) which include: DATE June, 2008-06-29 and TIME two fifty a m, 1:30 p.m. So I got the impresssion that this could be done with the NLTK's named entity tagger.
However, when I've run the tagger, it doesn't seem to pick up dates or times at all, as it does people or organizations. Does the NLTK named entity tagger not handle these date/time cases, or does it only pick up a specific date/time format? If it doesn't handle these cases, does anybody know of a system that does? Or is creating my own the only solution?
Thanks!
You should check out the contrib repository of NLTK - contains a module called timex.py or download it here:
https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py
From the first line of the module:
# Code for tagging temporal expressions in text

How to search for a person's name in a text? (heuristic)

I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
...The candidate Barack Obama was elected the president of the United States... (incomplete)
...The candidate Barack Hussein was elected the president of the United States... (incomplete)
...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.
Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.
At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.
I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.
The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.
I'd personnaly go for regular expressions while generating a list of permutations with some programming.
Both SQL Server and Oracle have built-in SOUNDEX Functions.
Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.
pure old regular expression scripting will do the job.
use Ruby, it's quite fast. read lines and match words.
cheers