Wikipedia Paragraphs to Google Sheet Columns

I am in the process of evaluating some Robotic Simulation software.  The first part of the process is to create a list of software to evaluate.  Wikipedia offers a good starting point with a list of package names, urls, and descriptions.  But what I want to do is to put this in a Google Sheet and add additional columns to it.

WikipediaListToGoogleSheet

 

Method 1: Copy & Paste

The first method that you might consider is copying and pasting the various components into a Google sheet.  There are 34 items in the list.  Each item contains 3 parts: package name, URL, and description.  Assuming you could copy and paste each component in less than 5 seconds, that would mean about 15 seconds per item.  all told 15 * 34 = 610 seconds.  This means about 10 minutes of work.

 

Method 2: Regular Expressions

Regular expressions allow for you to match pattern of strings and then to manipulate those patterns.  This allows for you to take a task above that would take 10 minutes of steady, solid, error prone work.  But there is a method that could take that same work and turn it into less than a minute of effort.  After which you will feel smart and could use the other 9 minutes to write a blog post like this.

 

Step 1:  Click on the Edit tab in Wikipedia:  That will let you see the Wiki Markup that looks like this:

* [http://www.parallemic.org/RoKiSim.html RoKiSim]: A free educational software for the 3D simulation of six-axis PUMA-type serial robots.
* [http://www.RoboAnalyzer.com RoboAnalyzer]: A free educational software to teach Robotics Concepts and Virtual Simulation (Joint Jogging and Cartesian Motion) using CAD models of Industrial Robots.

 

Step 2:  Copy and paste the wiki markup that represents the list of items and put it in NotePad++.

 

Step 3:  Identify what you are looking for and how you want to transform it.

You can see that each list item starts with a *.  The link and package name are in “[ ]”  And there is a semicolon that you want to get rid of.

 

 

Find this:

 * [http://www.parallemic.org/RoKiSim.html RoKiSim]: A free educational software for the 3D simulation of six-axis PUMA-type serial robots.

And replace it with (the extra spaces represent tabs… Tabs allow for us to put the information into different columns in Google Sheets):

 http://www.parallemic.org/RoKiSim.html    RoKiSim    A free educational software for the 3D simulation of six-axis PUMA-type serial robots.

 

Step 4: Determine the Regular Expression that you will need to search for and the replace

 

Find: \*\[(.*) (.*)\]

Replace: \1\t\2\t

 

Here is what your NotePadd++ replace dialog box should look like.  Click “Replace All”.

Screenshot 2014-11-24 10.31.30

 

Step 5:  Copy and Paste the results into Google Sheets and you will have columns containing the URL, Name, and description.  You can then move around the columns into the order you want.

 

Figuring out the regular expression is going to be the most difficult part of this process.  It is a skill that is not widely mastered, and looks mysterious.  But it would be better to spending more time to master this skill and eventually be able to quickly pull this together quickly, than to waste that time doing copy and paste.