Feature #3002
openOverhaul CAC search by using external search appliance
0%
Description
We currently use a Google Custom Search Engine for our primary search. This is bad in a number of ways:
1. It's Google
2. It's not faceted in ways that make sense for the Commons (ie, you can't distinguish between relevant content types, like Forum Posts vs Blog Posts)
3. It doesn't index private/hidden content
By using an external search appliance, like Sphinx or Elasticsearch, we can greatly enhance our search experience. In addition to improved discoverability of our content, we could leverage a search appliance in the future for topic modeling, data mining, recommendation engines, etc.
Requirements:
a. Should have a robust API, ideally with a PHP or even a WordPress integration already in the wild for us to start with
b. The search appliance itself should be fairly easy to get up and running, because it's likely that we'll have to do it ourselves given our lack of André
c. We should be able to index content in such a way that it respects differences in content type (groups, group forum posts, blog comments, wiki pages, etc etc etc)
d. We should be able to filter results based on privacy/visibility settings (this could happen either in the index or on the WordPress end)
I'm leaning toward Elasticsearch, as I think it meets all these criteria, and it just generally seems neat.
My plan of attack is to set up a test instance somewhere, and then take the existing WP integration plugin http://wordpress.org/plugins/wp-elasticsearch/ and start to build it out to handle BuddyPress and WP Multisite more elegantly. I'll spend a few hours exploring to get a sense of just how big a job it'll be, and then we can think about milestones.
Related issues
Updated by Matt Gold almost 11 years ago
This sounds very cool. Looking forward to learning more.
Updated by Boone Gorges about 5 years ago
- Has duplicate Feature #11826: Improving Search added
Updated by Boone Gorges about 5 years ago
- Target version changed from Future release to 1.17.0
From Luke in #11826:
In the past we've explored using a utility like Solr or Elastic for search on the Commons. I'm wondering if we might revisit this conversation this fall as part of our efforts to enhance discoverability across the system and/or search within individual, high traffic sites.
FWIW, GC IT is currently running a Solr server that we might test against, and JITP has gotten a couple of questions about more precise search results of its archives.
I'm working on a similar project for the City Tech OpenLab. Let's take time to assess that project sometime this fall, and consider how much of it is applicable for the Commons. (There are many complications related to privacy, result ordering and filtering, etc.)
Updated by Laurie Hurson about 5 years ago
Hi All,
Just a data point I discovered while searching for a site. The site is public so it should presumably come up in a search on the Commons.
The site title is "The Global Spanish-Speaking Community (SPAN 2204)". A search for "Spanish", "global Spanish", and "global spanish-speaking" return few results none of which are this site.
When I search in groups with the same terms the private group appears in search results and when I click into the group, I can see that there is a connected site.
I am wondering if because this instance was set up as a group/site, if the connection to a group is obscuring the site in a search for the site. In other words, group/site sites do not seem to come up in sites searches since they are connected to a group.
Since professors may setup groups only for emailing and they may request that students find a join the site. If group/site sites not coming up in site searches, we may want to fix this in our overhaul of the search functionality.
Updated by Boone Gorges about 5 years ago
Laurie, the issue in this case is that the site is not, in fact, fully public; it's set to "discourage search engines...". As such, it's not listed in the Sites directory, and won't turn up in searches. This point came up recently in https://redmine.gc.cuny.edu/issues/11201#note-7. IMO this is less a question about search and more about what shows up in directories, as it's perhaps not surprising that things hidden from directories would not turn up in directory search.
Updated by Laurie Hurson about 5 years ago
Hi Boone,
Thanks for this info. Sorry to being up repeat issues in another ticket. I didn't realize that "discourage search engines..." meant sites were totally hidden from the Commons internal directory. I assumed this was a "public but not so public" option that meant that sites would be less likely to be crawled by the likes of Google but would still be searchable on the Commons itself. I am not sure if it is clearly enough outlined in the language of this setting that the site is actually unsearchable anywhere (across google and the Commons) so we may want to consider changing the language? Just a suggestion.
Updated by Boone Gorges over 4 years ago
- Target version changed from 1.17.0 to Future release
I'm working on a similar project for the City Tech OpenLab. Let's take time to assess that project sometime this fall, and consider how much of it is applicable for the Commons. (There are many complications related to privacy, result ordering and filtering, etc.)
We did a bunch of research for this on the OpenLab. I was able to build a proof-of-concept integration for Groups that worked pretty well. But the requirement of an external search appliance is a serious one. I was able to make things work with an EC2 instance, but it cost $100+ per month just for the testing instance. So it'd only be feasible if we can host at the GC. This means a discussion with GC IT about how easily they can stand up and maintain an ElasticSearch instance for us. I know very little about how such a setup is maintained, so I'd need to learn and also lean heavily on Lihua's learning. I don't have a sense of how hard this would be. We'd also need to sink time into turning my proof-of-concept into something fully usable, which would be a pretty large amount of work - the privacy settings on the Commons in particular are complex in the context of search.
This could be the kind of project we tackle if we get an injection of external funding, such as through OER grants. It's "shovel ready", it's easy to tell a story about its impact, and it's large enough that it'd be a big drain if we tried to do it within our regular budget.
Updated by Chris Stein over 4 years ago
Boone, is this something that could be built in a way that it could apply to both the Commons and CBOX/CBOX-OL? If so maybe we can look into combining funding from a few sources.
Updated by Boone Gorges over 4 years ago
Chris Stein wrote:
Boone, is this something that could be built in a way that it could apply to both the Commons and CBOX/CBOX-OL? If so maybe we can look into combining funding from a few sources.
Yes, in part. However, writing something that's shareable with CBOX makes the initial task more complex, which would offset some of the benefits of sharing costs. Definitely something to think about, though.