On Twitter Jolene asked about parsing randomly formatted trinomials using Open Refine. Here is a recipe that should work in many cases (can’t guarantee all).
So, start with having a project open in Open Refine and the non-standard trinomials are given in a column called “Trinomial”. The following code will help parse out state, county, and site ID parts of the Trinomial in three new fields:
(1) Getting the Site Number from a Trinomial
Select the Trinomial field and “Add a new column based on this column”. The Name of this field will be “Parsed Site”. In the Grel expression window put in this:
toNumber(replace(trim(cells[‘Trinomial’].value), /[^0-9]+/, ‘/’).split(‘/’)[1])
To explain, a regular expression strips all non-number characters and converts this to 1 slash character (‘/’). So ’44SX202′ becomes ’44/202′. ’44-SX 202′ also becomes ’44/202′, and ’44-SX-00202′ becomes ’44/00202′. The “split” function then slits the ’44/202′ into a list of 2 items [’44’, ‘202’] or [’44’, ‘00202’]. We select the second part of that list to be the site ID (that’s the [1] part of the above expression, since in most programming languages the first item in a list is the 0 index, or [0], and so second is [1]). Finally, to remove any leading zeros (as with the case of ‘00202’), we convert this value from a string of characters to a number with the “toNumber” part of the expression.
(2) Getting the County Code from a Trinomial:
Select the Trinomial field and “Add a new column based on this column”. The Name of this field will be “Parsed County”. It’s a bit easier to get the county from a Trinomial, since a county code should be the only alphabetic characters in the trinomial.
toUppercase(cells[‘Trinomial’].value.replace(/[^A-z+]+/, ”))
To explain, the “replace” function uses a regular expression to delete all alphabet characters. Then the “toUppercase” function converts the result to uppercase characters.
(3) Getting the State Number from a Trinomial
Select the Trinomial field and “Add a new column based on this column”. The Name of this field will be “Parsed State”. In the Grel expression window put in this:
toNumber(replace(trim(cells[‘Trinomial’].value), /[^0-9]+/, ‘/’).split(‘/’)[0])
To explain, this is just like getting the Site Number from a trinomial, except we’re getting the first batch of numbers, not the second. That’s why we have the “[0]” in this expression and not “[1]”.
-
This topic was modified 6 years, 2 months ago by
Eric Kansa.