Project on many Offline Internet-Archives on the subject of Scientology, Freezone, Critics
- 1 Plan
- 2 Situation
- 3 WayBack-Machine - a semi-solution
- 4 Solution
- 5 The steps of the project
- 5.1 We collect collections of URLs
- 5.2 Entering all Scn-URLs in a table
- 5.3 Consolidation of the data in tbl_URLs
- 5.4 Checkout the software and hardware used
- 5.5 Document our findings on the steps until now for the public
- 5.6 Create a Mailing List for the Scn-Internet-Archive
- 5.7 Regularly download sessions of all the Scn-related-sites
- 5.8 Further project: An online-Scientology-Webarchive
- 6 Critical remarks about this project
- 7 Who is interested to work with me on this?
Mirroring of all websites and Newsgroups on the subject of Scientology (including critics, Church sites and independent) on many independent computer systems of the library network so that these data remain still available, even if they are pulled out of traffic by the original publisher. This is a necessity for a scientific work and to saveguard the freedom of speech.
There is no guarantee, that anything once published on the internet will stay there forevever!
I can see these reasons:
See an actual example for this here: Truth revealed: Website suppressed, which showed the truth about the take-over of Scientology by the US-government
If someone violates the copyrights, he is finally forced to take his site down.
Lack of Money
Some people shut down their valuable site, just because of lack of money. What a pitty.
Change of Mind
People change their mind. And so their websites change too. But others may regret this, as they saw some value in the website.
WayBack-Machine - a semi-solution
I know, that there already exists a project to archive the whole internet, including our subject: the WayBack-Machine here: http://www.archive.org/web/web.php
But there are some reasons, why this is not enough:
- You never know, whether this archive will be available for all the future to us. Perhaps this whole service will be shut down one day.
- You don't know, who really is behind this service. Perhaps they will someday support the censorship of the internet to our disadvantage and we don't get out any data on suppressed websites.
- It is very nice to have all downloaded pages on your local harddrive and be able to make a fulltext search only on these downloaded Scientology-websites. This would give better results than googling on the whole internet (including non-Scn sites). This is because you could look for i.e. a wellknown name but find only references in our context of Scientology. This would not be possible with www.google.com
- And additionally you would also find pages, which HAVE BEEN there once but are not available any more.
Where the WayBack-Machine does not work
The server that the particular piece of information lives on is down. Generally these clear up within two weeks. - Quote from http://web.petabox.bibalex.org/collections/web/faqs.html
If unknown URL
If the URL is unknown to the WayBack-Machine, there is no chance for any backup. But if we know about the site and its URL (perhaps it was passed by email from friend to friend), then we are able to make our archive.
For example if you enter http://lrh.myftp.org you find a huge collection of interesting materials, but the wayback machine knows nothing about this site.
Robots.txt Query Exclusion
For example I wanted to lookup this website on the wayback-machine: http://www.algonet.se/~tourtel/interests/hubbard_overview_of_my_scn_pages.html
I entered this:
But I got this answer: "We're sorry, access to http://www.algonet.se/~tourtel/interests/hubbard_overview_of_my_scn_pages.html has been blocked by the site owner via robots.txt."
So if the author want to avoid scanning by robots, this ends up with no archiv!
This is another quote of the WayBack-FAQ:
How can I remove my site's pages from the Wayback Machine?
The BA Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.
and some paragraphs later in the same FAQ:
Some sites are not available because of robots.txt or other exclusions. What does that mean?
The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Wayback Machine, you may find a site that is unavailable due to robots.txt (you will see a "robots.txt query exclusion error" message). Sometimes a web site owner will contact us directly and ask us to stop archiving a site, and we endeavor to comply with these requests. When you come across a "blocked site error" message, that means that a site owner has made such a request and it has been honored.
But we are able to download this site anyhow: we will tell our software to ignore robots.txt.
due to copyright violations
Path Index Error
A path index error message refers to a problem in our database wherein the information requested is not available (generally because of a machine or software issue; however each case can be different). We cannot always completely fix these errors in a timely manner. quote of the WayBack-FAQ
Not in Archive
Generally this means that the site archived has a redirect on it and the site you are redirected to is not in the archive or cannot be found on the live web. quote of the WayBack-FAQ
The server that the particular piece of information lives on is down. Generally these clear up within two weeks. quote of the WayBack-FAQ
no recent archives in the Wayback Machine
Pages are not added by the Internet Archive less than 6 months after they are collected, because of the time delayed donation from Alexa. Updates can take more than 12 months in some cases. quote of the WayBack-FAQ
Even the big national libraries don't rely on the WayBack-Machine but decided to make a copy of the internet for their own purposes.
Also the several yahoogroups etc should be included.
We should do this for ourselfes. Not as a central project, but as a cooperative one: Once the main work is done, everyone interested with a flat-rate for internet access and with a spare hard drive would be able to get his own copy of this archive.
There are valuable and free tools for this, see for example: http://de.wikipedia.org/wiki/HTTrack
Costless and easy.
And I expect, that a full backup of all Scientology & FZ Ressources and Sites will need no more space than a normal 500GB drive.
The steps of the project
We collect collections of URLs
First we need a collection of URLs on all the sites, which should be downloaded. A lot of people (churchies, freezoners and critics) have spend many hours in putting together lists of links to their favorite subjects and published these links on the internet.
To give you an idea, I will mention here some of these collections, if you know of further lists, PLEASE insert them directly here or send me an email: Special:Contact
With these handfull of links we start a Meta-Collection of Scn-Links: A collection of Link-collections!
I created a database table for this: tbl_collections
From there we come to the next step:
Entering all Scn-URLs in a table
From each of the above named collection we enter every single URL into another database table: tbl_URLs
We do not sort out and doublettes at that point, just take every URL we can find. And attach any available data on these URLs:
- owner of the site,
- name of the site,
- language of the site,
- category of the site: freezone, critic, church, church-member, mass-media, ...
- from which collection stem these data
- and some more
We already collected more than 1000 URLs and I think it will become perhaps 2000...
Consolidation of the data in tbl_URLs
Now we can use the means of a database and sort all information in the order of either the site-owner or the URL. By this we can start to consolidate the data by entering further data to each record:
- double record (=don't use it), original is number so-and-so
- offline site (perhaps we can recover it by the Way-Back-Machine)
- this "site" is a subdirectory of another one, so take that and insert here the number of the main-site entry
Checkout the software and hardware used
Perhaps before we start we should compare several competing software-tools like WinHTTrack (http://www.httrack.com), even if they cost some money, this could be well invested if it saves working time and hard-drive-space.
For example I think it is necessary like in the way back machine to check and download every site at least monthly. But we should not store identical pages, only new pages or changed pages. Is this possible with our tool? This would save a lot of hard drive space and make it possible to check and download sites very frequently.
And then we should have a pilot and find out, how much disk-space we will need. Perhaps one drive is enough for all?
Document our findings on the steps until now for the public
We should document our findings on the steps until now for the public. By this we will get further feedback (missing URLs, better tools, new ideas) and improve our project.
Also we make it possible for others, who don't have the time to do all these steps, to create their own archive. The more archives exist, the more save we are in the access to the data. Because hard drives can fail and with collegues we can access their copies without too much overhead!
Create a Mailing List for the Scn-Internet-Archive
That's why we will create a Mailing List for the Scn-Internet-Archive: To be able to cooperate with each other on this easily. No one of us needs a Backup of his hard drive as we are in comm with each other and if one of the hard drives fails, we exchange data among us. But there is of course also the possibility for people to just use our data and keep in hiding. No problem.
After we've downloaded the sites for some time, we'll observe which sites are down or up, and can make a note in our tbl_URLs which ones are stable, (not much change) or which are very active in updating. We mark this information in our tbl_URLs so that we can make more frequent backups with the more active sites and less frequent backups with the stable sites. With a group of people doing this, we could also, without much risk, share the work among us.
I would for example, download a more stable site just once a year. If I find that, for whatever reason, the site went off-line within the year, (and perhaps there were some updates I missed,) I would get a copy from one of my friends from around the date of it going off-line.
This way, 12 friends could coordinate their downloads of such sites by having one friend download them all in a particular month. That way, each friend downloads them all only once in a year. In case we need an actual up-to-date copy, we'll know who will have it. This saves download-time, working-time and hard-drive space.
A short test of a download and ressources
I made a short test and downloaded 320 sites, big and small ones, unsorted.
It took 36 hours to download 5 GBytes of data: more than 90.000 pages.
I assume that the final project will backup something between 3.000 and 30.000 sites and use 50 to 500 GBytes and need 15 to 150 days of download time for each set.
Handling of already offline Sites
Important sites which are already offline will be recovered by means of the Internet-Archive (WayBack-Machine http://web.archive.org ) and can be fulltext searched also as the other sites. This is also a big advantage against the internet archive, where you are not able to make any fulltext searches.
Further project: An online-Scientology-Webarchive
Similar to the Way Back Machine we could consider to put our collection online and give access to everyone. But this will need some more ressources and could be done later. My intention is to collect everything in time, as long as it is available.
Critical remarks about this project
Hi, Andreas. I'm not much interested in keeping the negativity alive. So sorry, I won't be any help with this project.
I don't look at all at the negative. But a lot of source materials, especially OT-levels first where published on critic sites.
Although I am not interested in reading these critic sites I will download them all and make a backup, because: who knows: perhaps there is something of interest in one of them and otherwise lost.
I am a librarian and thus we just offer the service of a library. It is up to the users how they use it and for what purpose.
A library is pure information and a tool for the researcher and the student.
And there is also some truth on critic sites and some truth and valuable data on church sites. And when they realize it, they suddenly take down their site and the truth could be lost.
Thats why I want to download that all. And it is not much effort. Just one Harddrive for all of it. So really no need to sort it out.
Who is interested to work with me on this?
I am looking for friends, who are willing to cooperate on this project. Do you know of someone working on this or interested in this?