Project

General

Profile

Actions

Support #18018

closed

archive their capstone project - problems

Added by Marilyn Weber about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority name:
High
Assignee:
-
Category name:
-
Target version:
Start date:
2023-04-12
Due date:
% Done:

0%

Estimated time:
Deployment actions:

Description

Matt has asked me to amplify this ticket.

Stephen Klein (GC library) writes
I am attempting to help a student archive their capstone project:
https://bisexuality.commons.gc.cuny.edu

and both Conifer (Web Recorder) and HTTRACK are not allowing us to capture her site. The site keeps on redirecting to:
https://commons.gc.cuny.edu/wp-login.php?p=-1&redirect_to=https%3A%2F%2Fbisexuality.commons.gc.cuny.edu
Within Conifer and even after logging in we are blocked.

HTTRACK presents these errors:
HTTrack3.49-2+htsswf+htsjava launched on Wed, 12 Apr 2023 16:54:19 at https://bisexuality.commons.gc.cuny.edu/ +.png +.gif +.jpg +.jpeg +.css +.js ad.doubleclick.net/* -mime:application/foobar
(winhttrack -qwC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!-
Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->" -%l "en, " https://bisexuality.commons.gc.cuny.edu/ -O1 "C:\Users\sklein\Desktop\Test" +.png +.gif +.jpg +.jpeg +.css +.js -ad.doubleclick.net/ -mime:application/foobar )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive information,
such as username/password authentication for websites mirrored in this project
do not share these files/folders if you want these information to remain private
16:54:20 Warning: Found for https://bisexuality.commons.gc.cuny.edu/robots.txt
16:54:20 Warning: File has moved from https://bisexuality.commons.gc.cuny.edu/robots.txt to https://bisexuality.commons.gc.cuny.edu/wp-login.php?redirect_to=https%3A%2F%2Fbisexuality.commons.gc.cuny.edu%2Frobots.txt&reauth=1
16:54:20 Warning: Found for https://bisexuality.commons.gc.cuny.edu/
16:54:20 Warning: File has moved from https://bisexuality.commons.gc.cuny.edu/ to https://bisexuality.commons.gc.cuny.edu/wp-login.php?redirect_to=https%3A%2F%2Fbisexuality.commons.gc.cuny.edu%2F&reauth=1
16:54:21 Warning: No data seems to have been transferred during this session! : restoring previous one!

This is a new outcome, so something must have changed on the Commons side.
Definitely general to the Commons because I tested on a few Commons sites and the same result occured.

They need to deposit by the 14th.

Best,
Stephen

Actions #1

Updated by Boone Gorges about 1 year ago

  • Status changed from New to Reporter Feedback

The site https://bisexuality.commons.gc.cuny.edu is configured so that it can only be viewed by members of the Commons. Third-party scraping tools like Conifer can only work on public content. The user will have to make the site publicly visible, at least temporarily. Dashboard > Settings > Reading.

Actions #2

Updated by Raymond Hoh about 1 year ago

It's possible to pass an authenticated cookie to HTTrack, which should allow you to scrape a private Commons site: https://stackoverflow.com/a/58354077.

The cookies you will be interested in can be found on this documentation page: https://developer.wordpress.org/advanced-administration/wordpress/cookies/#users-cookie

Specifically, I think the only cookie you will need is the one prefixed with "wordpress_logged_in_". If you need help to obtain your WordPress cookie, let me know.

Actions #3

Updated by Marilyn Weber about 1 year ago

Solved!

It was a no robots/no index issue. Both Conifer and HTTrack historically ignored do not index 'directives,' but something changed in Wordpress land and those directives seem to be more robost preventing crawls now. This was a good exercise and I am glad that it occurred before visiting the Praxis class next week. I will let Marily know that they can close the ticket; sorry for the inconvenience of a false alarm.

We continued to see intermittent errors, but I shared with Lacy that it is up to them if they want to archive a representative sample or attempt at being comprehensive.

All is good.

Thank you,
Stephen

Actions #4

Updated by Boone Gorges about 1 year ago

  • Status changed from Reporter Feedback to Resolved
  • Target version set to Not tracked

Excellent! Thanks, all.

Actions

Also available in: Atom PDF