Project

General

Profile

Actions

Support #17010

closed

robots.txt

Added by Marilyn Weber about 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority name:
Normal
Assignee:
Category name:
WordPress (misc)
Target version:
Start date:
2022-10-12
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

Wole Oyekoya writes

"I have been wondering why my publications (preprints) are not showing on Google Scholar and I just realized that the default robots.txt disallows pdf files. See https://virtualself.commons.gc.cuny.edu/robots.txt. I’d appreciate if you let me know how to edit the robots.txt. Thanks!"


Related issues

Related to CUNY Academic Commons - Bug #20860: is_home() issue in buddypress-docs linked to sitemap modsResolvedRaymond Hoh2024-08-30

Actions
Actions #1

Updated by Boone Gorges about 2 years ago

Our robots.txt file is shared across all sites on the Commons.

I'm not sure why we block pdf, doc, and docx files. This must be something that we added ourselves. Unfortunately, we don't track robots.txt in our repo for some reason. The last-touched date of the file is Oct 10 2018, but looking back at my email, it looks like this is when I added the Crawl-delay directive, which doesn't help explain why we exclude the other document types. I've paged back through many years of emails with the word 'robots' and was able to find that the robots.txt file did not have these directives in 2014, but this is not super helpful :)

Ray, do you have any recollection of why we did this? Is there danger if we remove the directives? My initial thought was privacy but it seems to me that our various privacy settings (eg automated htaccess files for non-public sites; serving bp-group-documents via PHP; etc) should prevent problems like this.

Actions #2

Updated by Raymond Hoh about 2 years ago

Ray, do you have any recollection of why we did this?

Unfortunately, I do not.

Is there danger if we remove the directives?

I don't think so. I think we should move our custom robots.txt customizations to the 'robots_txt' filter or the 'do_robots' action at a higher priority and remove the /robots.txt file from the production server. That way, other plugins that use these hooks could potentially make modifications to robots.txt such as Better WordPress Google XML Sitemaps, Yoast SEO and Jetpack.

If we decided to use a static robots.txt file due to performance reasons, then we can just remove the file type directives from the robots.txt file.

Actions #3

Updated by Boone Gorges about 2 years ago

  • Status changed from New to Resolved
  • Target version set to 2.0.10

Thanks, Ray. I agree with your assessment.

In https://github.com/cuny-academic-commons/cac/commit/088ae4270d56dafdaf4ff7291c8aafd4db8c99aa I made the following changes:
- Moved our robots.txt generation to the robots_txt filter
- Removed the file-type blocks discussed here (pdf, doc, docx)

I deployed this as hotfix and moved the existing robots.txt file to robots.txt.bak, which we can reference in the upcoming weeks if we notice any issues.

Actions #4

Updated by Marilyn Weber about 2 years ago

Thanks. Does this mean Wole can now edit? He is asking how to do that.

Actions #5

Updated by Boone Gorges about 2 years ago

No, users cannot edit their robots.txt. However, the .pdf restriction has been lifted sitewide, so Google should now be able to crawl these files.

Actions #6

Updated by Marilyn Weber about 2 years ago

Thanks, I let him know

Actions #7

Updated by Raymond Hoh about 2 years ago

Boone, I noticed that the generated wp-sitemap.xml file for the main site includes a lot of post types and taxonomies that may not be applicable such as:

For post types, we can use the 'wp_sitemaps_post_types' filter to omit certain post types from wp-sitemap.xml:

add_filter( 'wp_sitemaps_post_types', function( $retval ) {
    unset( $retval['cac_course'] );
    return $retval;
} );

For taxonomies, some of the taxonomies can be addressed by setting the 'publicly_queryable' property to false. If the taxonomies need to be publicly queryable, we can use the 'wp_sitemaps_taxonomies' filter similar to the 'wp_sitemaps_post_types' filter above.

Also, we should probably remove the Users sitemap from the main site because it references /author/USERNAME/:

Or we could keep it if we change the permalink from /author/USERNAME/ to /members/USERNAME/ so it references the BuddyPress profile URL instead with the following:

// Replace author profile URL with BP profile URL.
add_filter( 'wp_sitemaps_users_entry', function( $retval ) {
    if ( is_main_site() ) {
        $retval = str_replace( '/author/', '/members/', $retval );
    }

    return $retval;
} );
Actions #8

Updated by Boone Gorges about 2 years ago

  • Status changed from Resolved to Assigned
  • Assignee set to Raymond Hoh

These are all good catches, Ray.

For posterity, here's some reasoning:

- Hide https://commons.gc.cuny.edu/wp-sitemap-posts-cacsp_paper-1.xml because we no longer publicize the existence of Papers
- Hide the cac_course post types because they link to individual course URLs that don't represent anything (courses themselves are Sites, or Groups, or both)
- Hide https://commons.gc.cuny.edu/wp-sitemap-posts-tapor_tool-1.xml because we're not advertising TaPOR tools.
- Hide the taxonomies because they're needed for internal use, and are not meant to be publicly browseable in this way.

Looking over these taxonomies, I don't think any of them need to be publicly queryable. (IIRC the technical meaning of publicly_queryable is: can you access a taxonomy archive at a public URL - is that all it does?)

Or we could keep it if we change the permalink from /author/USERNAME/ to /members/USERNAME/ so it references the BuddyPress profile URL instead with the following:

I really like this idea (rather than hiding the directory).

Ray, could you go ahead and implement these changes?

Actions #9

Updated by Raymond Hoh about 2 years ago

Ray, could you go ahead and implement these changes?

Done in https://github.com/cuny-academic-commons/cac/compare/710bf1d...eaef35d984.

Looking over these taxonomies, I don't think any of them need to be publicly queryable. (IIRC the technical meaning of publicly_queryable is: can you access a taxonomy archive at a public URL - is that all it does?)

Yeah, I believe you're right. After reading the PHPDoc, I went one step further and switched from 'publicly_queryable' => false to 'public' => false since 'public' defaults to true if not set: https://github.com/WordPress/WordPress/blob/a921157f3a1fff5cd403612d7b5201043917860e/wp-includes/class-wp-taxonomy.php#L343. And the 'public' value gets passed down to 'publicly_queryable', 'show_ui' and 'show_in_nav_menus' if not set.

For CACAP-CAC, I've set those taxonomies to 'public' => false in https://github.com/cuny-academic-commons/cac/commit/15de62be8528332fdf9cf43b76b930ee5e56c57f.

For BuddyPress Docs, I've opened a PR: https://github.com/boonebgorges/buddypress-docs/pull/717. For CAC Group Library, I've also opened a PR: https://github.com/cuny-academic-commons/cac-group-library/pull/15. Until those PRs are merged, I've removed them from the sitemap temporarily in https://github.com/cuny-academic-commons/cac/compare/cb3539c...eaef35d984.

Actions #10

Updated by Boone Gorges about 2 years ago

  • Target version changed from 2.0.10 to 2.0.11
Actions #11

Updated by Boone Gorges about 2 years ago

  • Target version changed from 2.0.11 to 2.0.12
Actions #12

Updated by Boone Gorges almost 2 years ago

  • Target version changed from 2.0.12 to 2.0.13
Actions #13

Updated by Boone Gorges almost 2 years ago

  • Target version changed from 2.0.13 to 2.0.14
Actions #14

Updated by Boone Gorges almost 2 years ago

  • Target version changed from 2.0.14 to 2.0.15
Actions #15

Updated by Boone Gorges almost 2 years ago

  • Target version changed from 2.0.15 to 2.1.1
Actions #16

Updated by Boone Gorges almost 2 years ago

  • Target version changed from 2.1.1 to 2.1.2
Actions #17

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.2 to 2.1.3
Actions #18

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.3 to 2.1.4
Actions #19

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.4 to 2.1.5
Actions #20

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.5 to 2.1.6
Actions #21

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.6 to 2.1.7
Actions #22

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.7 to 2.1.8
Actions #23

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.8 to 2.1.9
Actions #24

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.9 to 2.1.10
Actions #25

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.10 to 2.1.11
Actions #26

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.11 to 2.1.12
Actions #27

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.12 to 2.1.13
Actions #28

Updated by Boone Gorges over 1 year ago

  • Target version changed from 2.1.13 to 2.1.14
Actions #29

Updated by Boone Gorges about 1 year ago

  • Target version changed from 2.1.14 to 2.1.15
Actions #30

Updated by Boone Gorges about 1 year ago

  • Target version changed from 2.1.15 to 2.1.16
Actions #31

Updated by Boone Gorges about 1 year ago

  • Target version changed from 2.1.16 to 2.2.1
Actions #32

Updated by Boone Gorges about 1 year ago

  • Target version changed from 2.2.1 to 2.2.2
Actions #33

Updated by Boone Gorges about 1 year ago

  • Target version changed from 2.2.2 to 2.2.3
Actions #34

Updated by Boone Gorges 12 months ago

  • Target version changed from 2.2.3 to 2.2.4
Actions #35

Updated by Boone Gorges 12 months ago

  • Target version changed from 2.2.4 to 2.2.5
Actions #36

Updated by Boone Gorges 11 months ago

  • Target version changed from 2.2.5 to 2.2.6
Actions #37

Updated by Boone Gorges 11 months ago

  • Target version changed from 2.2.6 to 2.3.1
Actions #38

Updated by Boone Gorges 10 months ago

  • Target version changed from 2.3.1 to 2.3.2
Actions #39

Updated by Boone Gorges 9 months ago

  • Target version changed from 2.3.2 to 2.3.3
Actions #40

Updated by Boone Gorges 9 months ago

  • Target version changed from 2.3.3 to 2.3.4
Actions #41

Updated by Boone Gorges 8 months ago

  • Target version changed from 2.3.4 to 2.3.5
Actions #42

Updated by Boone Gorges 8 months ago

  • Target version changed from 2.3.5 to 2.3.6
Actions #43

Updated by Boone Gorges 7 months ago

  • Target version changed from 2.3.6 to 2.3.7
Actions #44

Updated by Boone Gorges 6 months ago

  • Target version changed from 2.3.7 to 2.3.8
Actions #45

Updated by Boone Gorges 6 months ago

  • Target version changed from 2.3.8 to 589
Actions #46

Updated by Boone Gorges 5 months ago

  • Target version changed from 589 to 2.4.1
Actions #47

Updated by Boone Gorges 5 months ago

  • Target version changed from 2.4.1 to 2.4.2
Actions #48

Updated by Boone Gorges 5 months ago

  • Target version changed from 2.4.2 to 2.4.3
Actions #49

Updated by Boone Gorges 4 months ago

  • Target version changed from 2.4.3 to 2.4.4
Actions #50

Updated by Boone Gorges 3 months ago

  • Target version changed from 2.4.4 to 2.4.5
Actions #51

Updated by Raymond Hoh 3 months ago

Some updates, I've:

- Removed the temporary 'group-documents-category' taxonomy fix from the sitemap in https://github.com/cuny-academic-commons/cac/commit/078ffddbab575d4693ade88dc4d95439f98be628
- Removed the 'cacap_position_department' taxonomy from the sitemap in https://github.com/cuny-academic-commons/cac/commit/d38237d4a7e9337e7c25cbbc66a150531255891d
- Removed the 'cac-cv' post type from the sitemap because it is redundant with the users sitemap - https://github.com/cuny-academic-commons/cac/commit/a20ddacde6aea977a79b113fec4ac394cffb27c9

While auditing the sitemap, I noticed that our group docs were not resolving correctly. This should now be addressed in #20787.

Actions #52

Updated by Raymond Hoh 3 months ago

  • Category name set to WordPress (misc)
  • Status changed from Assigned to Staged for Production Release

Some more updates, I've:
- Set the forum post type in the sitemap to only display forums if there is at least one topic in https://github.com/cuny-academic-commons/cac/commit/13a6154f2134a295cfd13d512f3e5b2a9da652a0
- Disabled the reply post type from the sitemap in https://github.com/cuny-academic-commons/cac/commit/7f0289887729d21dcf37281dbdfc2bf3741c4897 . bbPress generates custom reply permalinks like /group/MY-GROUP/forum/reply/POST-ID/, but when navigating to this type of link, no content is displayed in the body. Even though this is a bbPress bug, I think it is not necessary to have a 'reply' sitemap since the 'topic' sitemap is more useful.
- Disabled the topic-tag taxonomy from the sitemap in https://github.com/cuny-academic-commons/cac/commit/6d4f94cac05c3037d9d07cd2eff873719a47b922 . This is used by bbPress. We decided during the group redesign to remove the topic tags feature, so it makes sense to remove forum topic tags from the sitemap as well.
- Removed duplicate recurring events from the sitemap in https://github.com/cuny-academic-commons/cac/commit/3235bfa32198530eb05939ea0dce851f2051627b . Recurring events should only display once in the sitemap. This is an issue with the Event Organiser plugin; I posted an issue to the plugin author about this here: https://github.com/stephenharris/Event-Organiser/issues/547 .
- Found an issue where some private BuddyPress Docs were being listed in the sitemap - https://commons.gc.cuny.edu/wp-sitemap-posts-bp_doc-1.xml. I traced this to a performance tweak we made to disable the DB query for BuddyPress Docs access only for the front page in https://github.com/cuny-academic-commons/cac/commit/641da630662e971ede7328cef0b2b5e321e8c205 . The problem is the sitemap page returns true for the is_front_page() check, so I added a check to see if we're on a sitemap page in https://github.com/cuny-academic-commons/cac/commit/881caa13e1fec7ed291e7c00c691148e15e41965 so the access query for BuddyPress Docs is reinstated.
- Removed some more temporary BuddyPress Docs taxonomy fixes from the sitemap in https://github.com/cuny-academic-commons/cac/commit/846988781dd62bdbfbb33c2a5f30c8a67244a9d9

I also found an issue with event venue pages not being styled correctly. I'll post a new ticket about this. Apart from that, I've done what I wanted to do here.

Actions #53

Updated by Boone Gorges 3 months ago

  • Status changed from Staged for Production Release to Resolved
Actions #54

Updated by Boone Gorges 3 months ago

  • Related to Bug #20860: is_home() issue in buddypress-docs linked to sitemap mods added
Actions

Also available in: Atom PDF