Tabulated review threads sorted by average score

Discussion in 'Trek Literature' started by Sho, Jan 13, 2012.

  1. JeBuS

    JeBuS Lieutenant Commander Red Shirt

    Joined:
    Sep 4, 2013
    Umm, the reason TF:R&D is ranked #1 is because it's picking up the wrong edition. This versus this. Huge difference. It begs the question how correct the rest of them are.
     
  2. Defcon

    Defcon Rear Admiral Rear Admiral

    Joined:
    May 9, 2003
    Location:
    Germany
    The problem lies with Goodreads since they have multiple listings for the book. On Shelfari you can request a merger of duplicate books, but I haven't found this option on Goodreads yet
     
  3. JeBuS

    JeBuS Lieutenant Commander Red Shirt

    Joined:
    Sep 4, 2013
    As a goodreads librarian, I thought I'd do it myself, but it's only possible when an item is on less than 5 people's shelves. That one had 7. I've put in a merge request so a super-librarian can do it. Hopefully it'll be sorted soon.

    Sho, why don't you use the ISBN of the books for the goodreads API?
     
  4. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    Because the threads here don't contain it. Remember that the whole thing runs fully autonomously and automated off the forum (and now Goodreads); I don't want to hunt for threads manually or collect metadata about them manually.

    I could use some sort of other source to map from title + author to ISBN -- but then Goodreads effectively does that too.

    Pretty sure this is an outlier, anyhow.
     
    Last edited: Nov 17, 2013
  5. JeBuS

    JeBuS Lieutenant Commander Red Shirt

    Joined:
    Sep 4, 2013
    Merger complete on TF:R&D

    But it's likely to happen every time a new book is announced. These outliers are the result of regular users adding the books as they can, without complete info. If a few people use it instead of the real one, it gets stuck, as happened with TF:R&D. Yeah, they probably eventually get merged, but it's not uncommon for them to be there.
     
  6. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    Thanks, I kicked off an out-of-schedule run and the new data is in now.


    No doubt. That's why, if you go back two pages to when this was first proposed, I wrote "Most likely the main challenge is reliably mapping a review thread to the goodreads entry in an automated fashion", anticipating such problems.

    I see no trivial solution. You seem to know Goodreads better though. If you have suggestions for algorithmically selecting the best search result better than Goodreads' result weighting works by itself (example: "use the entry with the highest number of ratings that still has Star Trek in the title") I'm interested.
     
  7. JeBuS

    JeBuS Lieutenant Commander Red Shirt

    Joined:
    Sep 4, 2013
    That would have been my suggestion, actually. Though who knows how similar some of the Trek novels' titles are to one another? It may just end up kicking the can down the road to a different problem.
     
  8. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    I think we'll just have to keep an eye out for how it behaves in practice.

    For the record, currently, the search process works like this:

    1. Take the thread title.
    2. Chop off the first segment ending in a colon (i.e. the series label) if there is one.
    3. Strip out the trigger phrase the thread discovery keys on ("Review Thread").
    4. Strip out anything that looks like author name initials (because the Goodreads search is super strict, and searching for an author initial can already mean no results since there is no basic string match against the fully written-out name - if I was implementing Goodreads' search engine, I would actually account for this).
    5. Strip out anything that looks like "(spoiler*)".
    6. Strip out the last occurance of the word "by".
    7. Strip out special chars like "&".
    8. Simplify and trim whitespace.
    9. Prepend "Star Trek ".
    10. Search Goodreads for it.

    So for example, "TF: A Ceremony of Losses by David Mack Review Thread (Spoilers!)" results in a search for "Star Trek A Ceremony of Losses David Mack".

    However, in the case of Revelation and Dust, the author name is included in the thread title as "DRGIII" (for space reasons). If you include that in the search, Goodreads' strictness results in zero search results. So I introduced a simple list of words to remove from titles and seeded it with DRGIII and KRAD for now.

    Now, before the two R&D entries were merged, the result for its search query ("Star Trek Revelation and Dust") contained both entries, but in the wrong order, i.e. with the bad one first. If I had written additional code to grab the entry with the largest number of ratings, that would have corrected for it.

    However, I did a quick test, and just replacing the kill words list with a substitution table (turning "DRGIII" into "David George") would have done the trick too: The same search with the author appended ("Star Trek Revelation and Dust David George") put the desired entry first.

    The difference is most likely because the bad entry's title had a higher similarity to our search terms - it contained "Star Trek" in the title, while for the right entry, the association to Star Trek was moved out into metadata. Supplying the author as well somehow convinced the scoring algorithm in Goodreads' search engine that the other entry was the better match.

    This is why I made the outlier claim early: Revelation and Dust is the only book we aren't searching for with author name included, and if you do include the author name, Goodreads' own result weighting seems to do an acceptable job.

    I'll implement the substitution thing later.

    Edit: Another idea would be to append instead of prepend "Star Trek" to the search terms. Since the right entries always have the Trek association in metadata, only bad entries would have "Star Trek" at the beginning of the title, and prepending then biases the scoring badly because of the high string similarity. Goodreads is most likely using a trivial Levenshtein distance algo for that.
     
  9. JeBuS

    JeBuS Lieutenant Commander Red Shirt

    Joined:
    Sep 4, 2013
    ^That is an interesting read. That's all I can say.
     
  10. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    Thanks ... :).

    If you're curious, this is how this looks in code form: http://www.eikehein.com/repositorie...13ca9742721af10fff7d4e3258d8edf785caf;hb=HEAD

    It's all a bit quick and dirty, but has been working pretty reliably so far. Lines 113-149 interact with Goodreads.

    I try to habitually future-proof this sort of stuff. If I run out of time or interest to look after the forum (the latter is unlikely, but who knows about the former?) I want things to just continue working unattended. And for the get-hit-by-a-bus scenario the source code is available so someone else can set it up.

    Another idea to make the search better btw. is to just run the search terms through a standard spell checker and allow it to do high-confidence substitutions. That would have taken care of the What Judgements Come problem.
     
  11. JeBuS

    JeBuS Lieutenant Commander Red Shirt

    Joined:
    Sep 4, 2013
    If that's quick and dirty, I'm afraid of what slow and clean looks like. It seems perfectly readable to me as-is.
     
  12. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    ^ Thanks. :)
     
  13. Defcon

    Defcon Rear Admiral Rear Admiral

    Joined:
    May 9, 2003
    Location:
    Germany
    And now Ex Machina has joined the group. :guffaw:
     
  14. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    I've now added the search improvements I mentioned (except for the spelling correction for now), and they don't seem to have impacted things negatively at least.

    Edit: I'm now also reporting Goodreads scores as "n/a" when there are less than 4 votes, the same bar used against thread votes for inclusion in the table.
     
    Last edited: Nov 17, 2013
  15. Avro Arrow

    Avro Arrow Vice Admiral Moderator

    Joined:
    Jan 10, 2003
    Location:
    Canada
    :lol: I didn't realize the "hit by a bus" thing was some sort of industry standard, that apparently even crosses national boundaries. Why do we never talk about other forms of death when discussing future supportability? Apparently coders spend a lot of their time running out into traffic suddenly... :lol:
     
  16. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    We also talk about the "bus number" of a codebase :).

    The "bus number" is the number of developers who'd need to get hit by a bus to seriously disrupt further development of the codebase because of the knowledge exclusive to their heads.

    I.e. if a codebase has a bus number of 1, only one developer needs to get up close and personal with the front of a bus traveling at high velocity, and the project no longer functions. Increasing the bus number isn't as easy as just adding more people, either, it's more about documenting things and other forms of spreading knowledge, making code accessible enough so someone can jump in and take over, and removing other barriers to someone else stepping up and accepting responsibility.
     
  17. trampledamage

    trampledamage Clone Admiral

    Joined:
    Sep 11, 2005
    Location:
    hitching a ride to Erebor
    :lol: I haven't come across bus number before, but we did used to have people designated as "bus people" in that they, absolutely, could not get hit or the project would collapse!

    Nice work, Sho !
     
  18. Sho

    Sho Fleet Captain Fleet Captain

    Joined:
    Sep 8, 2006
    Location:
    Berlin, Germany
    ^ Thanks!
     
  19. Defcon

    Defcon Rear Admiral Rear Admiral

    Joined:
    May 9, 2003
    Location:
    Germany
    So I have thought about a rough schedule for the next few "classic" review threads mixing up "newer" novels with older one while rotating through series (I'll go with the one book every two weeks schedule by the way):

    TNG: Death in Winter * Michael Jan Friedman
    DS9: Warped * K.W. Jeter
    VOY: String Theory #1: Cohesion * Jeffrey Lang
    Enterprise: By the Book * Dean Wesley Smith & K.K. Rusch
    IKS Gorkon: A good day to die * Keith R.A DeCandido
    Titan: Taking Wing *Andy Mangels & Michael A. Martin

    Any thoughts?
     
  20. Markonian

    Markonian Fleet Admiral Moderator

    Joined:
    Jun 2, 2012
    Location:
    Derbyshire, UK
    Sounds good to me. I'm going to read A Good Day to Die soon for the first time, and consider re-reading Taking Wing, too. It's great to share thoughts about books so old.

    Do you have a specific pattern which books/series you choose for the classics threads?