Project

General

Profile

Actions

Support #17010

open

robots.txt

Added by Marilyn Weber 4 months ago. Updated 27 days ago.

Status:
Assigned
Priority name:
Normal
Assignee:
Category name:
-
Target version:
Start date:
2022-10-12
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

Wole Oyekoya writes

"I have been wondering why my publications (preprints) are not showing on Google Scholar and I just realized that the default robots.txt disallows pdf files. See https://virtualself.commons.gc.cuny.edu/robots.txt. I’d appreciate if you let me know how to edit the robots.txt. Thanks!"

Actions #1

Updated by Boone Gorges 4 months ago

Our robots.txt file is shared across all sites on the Commons.

I'm not sure why we block pdf, doc, and docx files. This must be something that we added ourselves. Unfortunately, we don't track robots.txt in our repo for some reason. The last-touched date of the file is Oct 10 2018, but looking back at my email, it looks like this is when I added the Crawl-delay directive, which doesn't help explain why we exclude the other document types. I've paged back through many years of emails with the word 'robots' and was able to find that the robots.txt file did not have these directives in 2014, but this is not super helpful :)

Ray, do you have any recollection of why we did this? Is there danger if we remove the directives? My initial thought was privacy but it seems to me that our various privacy settings (eg automated htaccess files for non-public sites; serving bp-group-documents via PHP; etc) should prevent problems like this.

Actions #2

Updated by Raymond Hoh 4 months ago

Ray, do you have any recollection of why we did this?

Unfortunately, I do not.

Is there danger if we remove the directives?

I don't think so. I think we should move our custom robots.txt customizations to the 'robots_txt' filter or the 'do_robots' action at a higher priority and remove the /robots.txt file from the production server. That way, other plugins that use these hooks could potentially make modifications to robots.txt such as Better WordPress Google XML Sitemaps, Yoast SEO and Jetpack.

If we decided to use a static robots.txt file due to performance reasons, then we can just remove the file type directives from the robots.txt file.

Actions #3

Updated by Boone Gorges 4 months ago

  • Status changed from New to Resolved
  • Target version set to 2.0.10

Thanks, Ray. I agree with your assessment.

In https://github.com/cuny-academic-commons/cac/commit/088ae4270d56dafdaf4ff7291c8aafd4db8c99aa I made the following changes:
- Moved our robots.txt generation to the robots_txt filter
- Removed the file-type blocks discussed here (pdf, doc, docx)

I deployed this as hotfix and moved the existing robots.txt file to robots.txt.bak, which we can reference in the upcoming weeks if we notice any issues.

Actions #4

Updated by Marilyn Weber 4 months ago

Thanks. Does this mean Wole can now edit? He is asking how to do that.

Actions #5

Updated by Boone Gorges 4 months ago

No, users cannot edit their robots.txt. However, the .pdf restriction has been lifted sitewide, so Google should now be able to crawl these files.

Actions #6

Updated by Marilyn Weber 4 months ago

Thanks, I let him know

Actions #7

Updated by Raymond Hoh 4 months ago

Boone, I noticed that the generated wp-sitemap.xml file for the main site includes a lot of post types and taxonomies that may not be applicable such as:

For post types, we can use the 'wp_sitemaps_post_types' filter to omit certain post types from wp-sitemap.xml:

add_filter( 'wp_sitemaps_post_types', function( $retval ) {
    unset( $retval['cac_course'] );
    return $retval;
} );

For taxonomies, some of the taxonomies can be addressed by setting the 'publicly_queryable' property to false. If the taxonomies need to be publicly queryable, we can use the 'wp_sitemaps_taxonomies' filter similar to the 'wp_sitemaps_post_types' filter above.

Also, we should probably remove the Users sitemap from the main site because it references /author/USERNAME/:

Or we could keep it if we change the permalink from /author/USERNAME/ to /members/USERNAME/ so it references the BuddyPress profile URL instead with the following:

// Replace author profile URL with BP profile URL.
add_filter( 'wp_sitemaps_users_entry', function( $retval ) {
    if ( is_main_site() ) {
        $retval = str_replace( '/author/', '/members/', $retval );
    }

    return $retval;
} );
Actions #8

Updated by Boone Gorges 4 months ago

  • Status changed from Resolved to Assigned
  • Assignee set to Raymond Hoh

These are all good catches, Ray.

For posterity, here's some reasoning:

- Hide https://commons.gc.cuny.edu/wp-sitemap-posts-cacsp_paper-1.xml because we no longer publicize the existence of Papers
- Hide the cac_course post types because they link to individual course URLs that don't represent anything (courses themselves are Sites, or Groups, or both)
- Hide https://commons.gc.cuny.edu/wp-sitemap-posts-tapor_tool-1.xml because we're not advertising TaPOR tools.
- Hide the taxonomies because they're needed for internal use, and are not meant to be publicly browseable in this way.

Looking over these taxonomies, I don't think any of them need to be publicly queryable. (IIRC the technical meaning of publicly_queryable is: can you access a taxonomy archive at a public URL - is that all it does?)

Or we could keep it if we change the permalink from /author/USERNAME/ to /members/USERNAME/ so it references the BuddyPress profile URL instead with the following:

I really like this idea (rather than hiding the directory).

Ray, could you go ahead and implement these changes?

Actions #9

Updated by Raymond Hoh 3 months ago

Ray, could you go ahead and implement these changes?

Done in https://github.com/cuny-academic-commons/cac/compare/710bf1d...eaef35d984.

Looking over these taxonomies, I don't think any of them need to be publicly queryable. (IIRC the technical meaning of publicly_queryable is: can you access a taxonomy archive at a public URL - is that all it does?)

Yeah, I believe you're right. After reading the PHPDoc, I went one step further and switched from 'publicly_queryable' => false to 'public' => false since 'public' defaults to true if not set: https://github.com/WordPress/WordPress/blob/a921157f3a1fff5cd403612d7b5201043917860e/wp-includes/class-wp-taxonomy.php#L343. And the 'public' value gets passed down to 'publicly_queryable', 'show_ui' and 'show_in_nav_menus' if not set.

For CACAP-CAC, I've set those taxonomies to 'public' => false in https://github.com/cuny-academic-commons/cac/commit/15de62be8528332fdf9cf43b76b930ee5e56c57f.

For BuddyPress Docs, I've opened a PR: https://github.com/boonebgorges/buddypress-docs/pull/717. For CAC Group Library, I've also opened a PR: https://github.com/cuny-academic-commons/cac-group-library/pull/15. Until those PRs are merged, I've removed them from the sitemap temporarily in https://github.com/cuny-academic-commons/cac/compare/cb3539c...eaef35d984.

Actions #10

Updated by Boone Gorges 3 months ago

  • Target version changed from 2.0.10 to 2.0.11
Actions #11

Updated by Boone Gorges 3 months ago

  • Target version changed from 2.0.11 to 2.0.12
Actions #12

Updated by Boone Gorges 3 months ago

  • Target version changed from 2.0.12 to 2.0.13
Actions #13

Updated by Boone Gorges about 2 months ago

  • Target version changed from 2.0.13 to 2.0.14
Actions #14

Updated by Boone Gorges about 1 month ago

  • Target version changed from 2.0.14 to 2.0.15
Actions #15

Updated by Boone Gorges 27 days ago

  • Target version changed from 2.0.15 to 2.1.1
Actions

Also available in: Atom PDF