Mugo Web main content.

eep case study: Author name resolution in The 49th Shelf

By: Doug Plant | July 31, 2012 | Case study and eZ Publish development tips

The problem:

One of our favourite projects, The 49th Shelf, aggregates a lot of data from a diversity of sources. Naturally, there is a range of quality; but even more than that: different sources often refer to the same physical object in different ways. Specifically, a person who writes several books can be referred to in a variety of ways:

  • John Reynolds
  • John L. Reynolds
  • John Lawrence Reynolds

There are some heuristics to resolve most of these cases, but there are always seem to be some that defy the code and require a manual fix.

This process has to be performed on the production site, so we want to minimize the amount of time that the data is inconsistent and we don't want to run untested code.

The solution:

The fix comes in two parts:

  • A facility within The 49th Shelf to provide canonical names for all the versions of the name, and
  • eep, the eZ Publish command line tool, which massively helps in all phases of the resolution.

So, today, I have to resolve Mr Reynolds. The first thing is to locate the contributor objects:

> eep list children 2  | grep Contrib
 |      95 |      94 |             folder |      /1/2/94/ |   26 |    1 | 0/0 |  0 |         Contributors | 

>  eep list children 94 | grep R
 | Object | Node |              Class |         Path | Ch'n | Cont | H/I | RR | Name | 
 |    332 |  328 | contributor_folder | /1/2/94/328/ | 1267 |    1 | 0/0 |  0 |    R | 

>  eep list children 328 --limit=2000 | grep Reynolds | grep John
 | 1174990 | 1174899 | onix_contributor | /1/2/94/328/1174899/ |    0 |    1 | 0/0 | 12 | Reynolds, John La... | 
 | 2021531 | 2021646 | onix_contributor | /1/2/94/328/2021646/ |    0 |    1 | 0/0 |  3 |       Reynolds, John | 
 | 2021847 | 2021962 | onix_contributor | /1/2/94/328/2021962/ |    0 |    1 | 0/0 |  1 |    Reynolds, John L. | 

I need the books that are reverse-related to these contributors.

> eep list children 328 --limit=2000 | grep Reynolds | grep John | awk '{print $2}' > jlr_contributor_oids.txt

> cat jlr_contributor_oids.txt | xargs -IOID eep co reverserelated OID
 +----------+------------+-----------------+-----+---------------------------+
 | Reverse related objects [1174990]                                         | 
 +----------+------------+-----------------+-----+---------------------------+
 | ObjectID | MainNodeID | ClassIdentifier | SID |                      Name | 
 +----------+------------+-----------------+-----+---------------------------+
 |   304206 |     304162 |    onix_product |   1 |               Holy Terror | 
 |  2956950 |    2959420 |    onix_product |   1 | When All You Have Is Hope | 
 |  2957002 |    2959472 |    onix_product |   1 |    The Skeptical Investor | 
 |  2957568 |    2960043 |    onix_product |   1 |    The Skeptical Investor | 
 |  2958563 |    2961048 |    onix_product |   1 |        The Naked Investor | 
 |  2262509 |    2263184 |    onix_product |   1 | When All You Have Is Hope | 
 |  2959392 |    2961880 |    onix_product |   1 |                 Prognosis | 
 |  2959696 |    2962187 |    onix_product |   1 |        The Naked Investor | 
 |   159104 |     159062 |    onix_product |   1 |             Shadow People | 
 |   158823 |     158781 |    onix_product |   1 |             Shadow People | 
 |  3137774 |    3140931 |    onix_product |   1 |               Beach Strip | 
 |  3134888 |    3138026 |    onix_product |   1 |               Beach Strip | 
 +----------+------------+-----------------+-----+---------------------------+

 +----------+------------+-----------------+-----+-----------------------------+
 | Reverse related objects [2021531]                                           | 
 +----------+------------+-----------------+-----+-----------------------------+
 | ObjectID | MainNodeID | ClassIdentifier | SID |                        Name | 
 +----------+------------+-----------------+-----+-----------------------------+
 |  1303215 |    1303121 |    onix_product |   1 | Bubbles, Bankers & Bailouts | 
 |  1302854 |    1302760 |    onix_product |   1 |          One Hell of a Ride | 
 |  2147681 |    2148108 |    onix_product |   1 |          One Hell of a Ride | 
 +----------+------------+-----------------+-----+-----------------------------+

 +----------+------------+-----------------+-----+---------------+
 | Reverse related objects [2021847]                             | 
 +----------+------------+-----------------+-----+---------------+
 | ObjectID | MainNodeID | ClassIdentifier | SID |          Name | 
 +----------+------------+-----------------+-----+---------------+
 |   158823 |     158781 |    onix_product |   1 | Shadow People | 
 +----------+------------+-----------------+-----+---------------+

and save the object ids of the related books:

> cat jlr_contributor_oids.txt | xargs -IOID eep co reverserelated OID | awk '/  1 / {print $2}' > jlr_book_oids.txt

This is all going to work because: (a) we can delete and reimport contributors without difficulty, and (b) we can reimport books repeatedly without difficulty.

Also, I've written an extension to eep that wraps up a bunch of tasks that are specific to The 49th Shelf. The task that is useful here is simply going to add a rule to the name authority database within The 49th Shelf. However it's worth noting that writing extensions to eep is trivial - it provides a quick way to write PHP that runs in the context of an eZ kernel with a couple advantages over the conventional methods: the code isn't tied to a particular kernel-instance and you can run it anywhere in the filesystem. But more on that another time. This is how eep adds a new rule:

> eep books
...
addcontriblookup
  eep books addcontriblookup <orig first name> <orig last name> <converted first name> <converted last name> <converted title>
  - note that 'title' is typically: <last name>, <first name>

So I'm going to add 2 lookups which map the 2 lesser versions of the name onto the more common version:

> eep books addcontriblookup John Reynolds "John Lawrence" Reynolds "Reynolds, John Lawrence"
> eep books addcontriblookup "John L." Reynolds "John Lawrence" Reynolds "Reynolds, John Lawrence"

Now, reset the data.

I'll need the ISBNs of the affected books to reimport them, so I'll cache these now:

> cat jlr_book_oids.txt | xargs -IOID eep attribute tostring OID isbn > jlr_isbns.txt

Delete all 3 contributors - although 2 would also be ok:

> cat jlr_contributor_oids.txt | xargs -IOID eep contentobject delete OID

Download new data for these books (and save the cache ids so that I can import them):

> cat jlr_isbns.txt | xargs -IISBN eep books downloadisbn ISBN | awk -F\' '/Saved buffer:/ {print $2}' > jlr_importids.txt

Reimport the titles which updates the book data and recreates the contributors, but now with the name-authority rules in operation:

> cat jlr_importids.txt | xargs -IIMPORTID php runcronjobs.php oniximport --specificrecord=IMPORTID

And clean up:

> rm jlr_*

And that's it.

Useful resources