Support #18018
closedarchive their capstone project - problems
0%
Description
Matt has asked me to amplify this ticket.
Stephen Klein (GC library) writes
I am attempting to help a student archive their capstone project:
https://bisexuality.commons.gc.cuny.edu
and both Conifer (Web Recorder) and HTTRACK are not allowing us to capture her site. The site keeps on redirecting to:
https://commons.gc.cuny.edu/wp-login.php?p=-1&redirect_to=https%3A%2F%2Fbisexuality.commons.gc.cuny.edu
Within Conifer and even after logging in we are blocked.
HTTRACK presents these errors:
HTTrack3.49-2+htsswf+htsjava launched on Wed, 12 Apr 2023 16:54:19 at https://bisexuality.commons.gc.cuny.edu/ +.png +.gif +.jpg +.jpeg +.css +.js ad.doubleclick.net/* -mime:application/foobar Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->" -%l "en, " https://bisexuality.commons.gc.cuny.edu/ -O1 "C:\Users\sklein\Desktop\Test" +.png +.gif +.jpg +.jpeg +.css +.js -ad.doubleclick.net/ -mime:application/foobar )
(winhttrack -qwC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!-
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive information,
such as username/password authentication for websites mirrored in this project
do not share these files/folders if you want these information to remain private
16:54:20 Warning: Found for https://bisexuality.commons.gc.cuny.edu/robots.txt
16:54:20 Warning: File has moved from https://bisexuality.commons.gc.cuny.edu/robots.txt to https://bisexuality.commons.gc.cuny.edu/wp-login.php?redirect_to=https%3A%2F%2Fbisexuality.commons.gc.cuny.edu%2Frobots.txt&reauth=1
16:54:20 Warning: Found for https://bisexuality.commons.gc.cuny.edu/
16:54:20 Warning: File has moved from https://bisexuality.commons.gc.cuny.edu/ to https://bisexuality.commons.gc.cuny.edu/wp-login.php?redirect_to=https%3A%2F%2Fbisexuality.commons.gc.cuny.edu%2F&reauth=1
16:54:21 Warning: No data seems to have been transferred during this session! : restoring previous one!
This is a new outcome, so something must have changed on the Commons side.
Definitely general to the Commons because I tested on a few Commons sites and the same result occured.
They need to deposit by the 14th.
Best,
Stephen
Updated by Boone Gorges over 1 year ago
- Status changed from New to Reporter Feedback
The site https://bisexuality.commons.gc.cuny.edu is configured so that it can only be viewed by members of the Commons. Third-party scraping tools like Conifer can only work on public content. The user will have to make the site publicly visible, at least temporarily. Dashboard > Settings > Reading.
Updated by Raymond Hoh over 1 year ago
It's possible to pass an authenticated cookie to HTTrack, which should allow you to scrape a private Commons site: https://stackoverflow.com/a/58354077.
The cookies you will be interested in can be found on this documentation page: https://developer.wordpress.org/advanced-administration/wordpress/cookies/#users-cookie
Specifically, I think the only cookie you will need is the one prefixed with "wordpress_logged_in_
". If you need help to obtain your WordPress cookie, let me know.
Updated by Marilyn Weber over 1 year ago
Solved!
It was a no robots/no index issue. Both Conifer and HTTrack historically ignored do not index 'directives,' but something changed in Wordpress land and those directives seem to be more robost preventing crawls now. This was a good exercise and I am glad that it occurred before visiting the Praxis class next week. I will let Marily know that they can close the ticket; sorry for the inconvenience of a false alarm.
We continued to see intermittent errors, but I shared with Lacy that it is up to them if they want to archive a representative sample or attempt at being comprehensive.
All is good.
Thank you,
Stephen
Updated by Boone Gorges over 1 year ago
- Status changed from Reporter Feedback to Resolved
- Target version set to Not tracked
Excellent! Thanks, all.