What can I do with an inconsistent column delimited text file? - csv

I have a text file that looks something like...
firstname:middle:lastname
firstname:middle:lastname
firstname:lastname
firstname:middle:lastname
firstname:lastname
I would like to be able to eventually use this information in a MySQL database, but since the columns are not correct I am not sure what to do. Is there any way to resolve this?

If the data you have is only the above variations, then you can make the assumptions:
First part is the firstname
Last part is the lastname
Therefore if using PHP for example you could use explode to separate the data on the delimeter such as in this case being :.
When looping through each row just assume the last part is the lastname, first part is the firstname and the middle part is the middlename.
You can use count() to find out how many parts are in the specific row you are reading inside the loop. This should allow you to figure out which one is the last part.

If the file is so simple ... the solution is trivial
firstname:middle:lastname
firstname:lastname
if(there are only two columns) { that means we have first and last name }
else { we have first, middle and last name }
If there are more columns, you could maybe resolve data to proper columns if you manage to build a priority list (like in what order they could be missing, for example 'last name > first name > middle name') or/and if you could combine that with data type matching (string/int/double/date) ... anyway you need to gather all your domain knowledge and see if that suffice.

Related

Regular expression to pick a row in an html table containing desired text

Sorry, but uhrm, I'd like to use regexp (actually I'd use something else but I want to do the task within a Matlab function) to pick a single row containing desired keywords within an html table.
I am using Matlab calling function regexpi (case-insensitive version of regexp), which is akin to PHP regex from what I can tell.
Ok, here's a snippet from such an html table to parse:
<tr><td>blu</td><td>value</td></tr><tr><td>findme</td><td>value</td></tr><tr><td>ble</td><td>value</td></tr>
The desired row to pick contains the word "findme".
(added:) Content of other cells and tags in the table could be anything (here "bla" is a dummy value)- the important part is the presence of "findme" and that a single line (not more) is caught (or all lines containing "findme" but such behaviour is not expected). Any paired name/value table in a wikipedia page is a good example.
I tinkered with https://regex101.com/ using whatever I could dig up at the Matlab documentation (forward/backward looking, combinations of :,> and ?), but have failed to identify a pattern that will pick just the right row (or all those that contain the keyword "findme"). The following pattern for instance will pick the text but not the entire row: <tr[^>]*>[^>]*.*?(findme).*?<\/td .
Pattern <tr[^>]*>(.*?findme.*?)<\/tr[^>]*> picks the row but is too greedy and picks preceding rows.
Note that the original task I had set out was to capture entire tables and then parse these, but the Matlab regexp-powered function I found for the task had trouble with nested tables (or I had trouble implementing it for the task).
The question is how to return a row containing desired keywords from an html table, programmatically, within a matlab function (without calling an external program)? Bonus question is how to solve the nested table issue, but maybe that's another question.
I suggest you split up the string with strsplit and use contains for the filtering, which is a lot more readable and maintainable than a regex pattern:
htmlString = ['<tr><td>blu</td><td>value</td></tr><tr><td><a',...
'href="bla">findme</a></td><td>value</td></tr><tr><td><a',...
'href="ble">ble</a></td><td>value</td></tr>'];
keyword = 'findme';
splitStrings = strsplit(htmlString,'<tr>');
desiredRow = ['<tr>' splitStrings{contains(splitStrings,keyword)}]
The output is:
<tr><td>findme</td><td>value</td></tr>
Alternatively you may also combine extractBetween and contains:
allRows = extractBetween(htmlString,'<tr>','</tr>');
desiredRow = ['<tr>' allRows{contains(allRows,keyword)} '</tr>']
If you must use regex:
regexp(htmlString,['<tr><td>[^>]+>' keyword '.*?<\/tr>'],'match')
Try this
%<td>(.*?)%sg
https://regex101.com/r/0Xq0mO/1

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

having a bit of trouble trying to match fields and add fields to the matched ones

So ive been busting my brain for weeks. so id like to know that in MongoDB i have a table, so in this table there are multiple json objects, all objects have the same fields in, so what im trying to achieve is: match all doctuments by a specific set of ids eg "id": "1234" id like to match a specific fiel in all json objects and then add in another field just for those who match the search.. any ideas?
Although I am not fully clear about your question, but from whatever I understood,you can use $aggregate to solve your problem.
Refer to the below link->
https://docs.mongodb.com/v3.2/reference/method/db.collection.aggregate/

select multi url from one column in mysql table

I have a table with "content" column store forum post, there is one or more url in one record of "content" field, I want to get all the url in the “content" column, one url in one row, I use below code
select substr(`content`, locate(`content`,"http://"))
it work for one url in one record, get a list of url like
http://www.google.com
http://www.facebook.com
...
it only get the first url if there are more than one url in the record.
how to fix it?
Another way to look at it is to try:
SELECT GROUP_CONCAT(substr(`content`, locate(`content`,"http://"))) FROM your_table;
which would concatenate all URLs to a single string and carry on from there - maybe you can split it in the code rather than require the DB to do it. Otherwise you can hack on using an auxiliary table of integers 1-n: SQL split comma separated row

Find column values that are a start string of given string.

I have a database table that contains URLs in a column. I want to show certain data depending on what page the user is on, defaulting to a 'parent' page if not a direct match. How can I find the columns where the value is part of the submitted URL?
Eg. I have www.example.com/foo/bar/baz/here.html; I would expect to see (after sorting on length of column value):
www.example.com/foo/bar/baz/here.html
www.example.com/foo/bar/baz
www.example.com/foo/bar
www.example.com/foo
www.example.com
if all those URLs are in the table of course.
Is there a built in function or would I need to create a procedure? Googling kept getting me to LIKE and REGEXP, which is not what I need. I figured that a single query would be much more efficient than chopping the URL and making multiple queries (the URLs could potentially contain many path components).
Simple turn around the "Like" operator:
SELECT * FROM urls WHERE "www.example.com/foo/bar/baz/here.html" LIKE CONCAT(url, "%");
http://sqlfiddle.com/#!2/ef6ee/1