Difference between revisions of "Project on many Offline Internet-Archives on the subject of Scientology, Freezone, Critics"

Revision as of 00:12, 1 September 2008

Plan

Mirroring of all websites and Newsgroups on the subject of Scientology (including critics, Church sites and independent) on many independent computer systems of the library network so that these data remain still available, even if they are pulled out of traffic by the original publisher. This is a necessity for a scientific work and to saveguard the freedom of speech.

Situation

There is no guarantee, that anything once published on the internet will stay there forevever! See an actual example for this here: Truth revealed: Website suppressed, which showed the truth about the take-over of Scientology by the US-government

I know, that there already exists a project to archive the whole internet, including our subject: the WayBack-Machine here: http://www.archive.org/web/web.php

But there are some reasons, why this is not enough:

You never know, whether this archive will be available for all the future to us. Perhaps this whole service will be shut down one day.
You don't know, who really is behind this service. This site is in Alexandria, Egypt. There exists a CIA-headquarter in Egypt. Perhaps they will someday support the censorship of the internet to our disadvantage and we don't get out any data on suppressed websites.
It is very nice to have all downloaded pages on your local harddrive and be able to make a fulltext search only on these downloaded Scientology-websites. This would give better results than googling on the whole internet (including non-Scn sites). This is because you could look for i.e. a wellknown name but find only references in our context of Scientology. This would not be possible with www.google.com
And additionally you would also find pages, which HAVE BEEN there once but are not available any more.

Solution

Even the big national libraries don't rely on the WayBack-Machine but decided to make a copy of the internet for their own purposes.

Also the several yahoogroups etc should be included.

We should do this for ourselfes. Not as a central project, but as a cooperative one: Once the main work is done, everyone interested with a flat-rate for internet access and with a spare hard drive would be able to get his own copy of this archive.

There are valuable and free tools for this, see for example: http://de.wikipedia.org/wiki/HTTrack

Costless and easy.

And I expect, that a full backup of all Scientology & FZ Ressources and Sites will need no more space than a normal 500GB drive.

The steps of the project

We collect collections of URLs

First we need a collection of URLs on all the sites, which should be downloaded. A lot of people (churchies, freezoners and critics) have spend many hours in putting together lists of links to their favorite subjects and published these links on the internet.

To give you an idea, I will mention here some of these collections, if you know of further lists, PLEASE insert them directly here or send me an email: Special:Contact

With these handfull of links we start a Meta-Collection of Scn-Links: A collection of Link-collections!

I created a database table for this: tbl_collections

From there we come to the next step:

Entering all Scn-URLs in a table

From each of the above named collection we enter every single URL into another database table: tbl_URLs

We do not sort out and doublettes at that point, just take every URL we can find. And attach any available data on these URLs:

owner of the site,
name of the site,
language of the site,
category of the site: freezone, critic, church, church-member, mass-media, ...
from which collection stem these data
and some more

We already collected more than 600 URLs and I think it will become perhaps 2000...

Consolidation of the data in tbl_URLs

Now we can use the means of a database and sort all information in the order of either the site-owner or the URL. By this we can start to consolidate the data by entering further data to each record:

double record (=don't use it), original is number so-and-so
offline site (perhaps we can recover it by the Way-Back-Machine)
this "site" is a subdirectory of another one, so take that and insert here the number of the main-site entry

Checkout the software and hardware used

Perhaps before we start we should compare several competing software-tools like WinHTTrack (http://www.httrack.com), even if they cost some money, this could be well invested if it saves working time and hard-drive-space.

For example I think it is necessary like in the way back machine to check and download every site at least monthly. But we should not store identical pages, only new pages or changed pages. Is this possible with our tool? This would save a lot of hard drive space and make it possible to check and download sites very frequently.

And then we should have a pilot and find out, how much disk-space we will need. Perhaps one drive is enough for all?

Document our findings on the steps until now for the public

We should document our findings on the steps until now for the public. By this we will get further feedback (missing URLs, better tools, new ideas) and improve our project.

Also we make it possible for others, who don't have the time to do all these steps, to create their own archive. The more archives exist, the more save we are in the access to the data. Because hard drives can fail and with collegues we can access their copies without too much overhead!

Create a Mailing List for the Scn-Internet-Archive

Thats why we will create a Mailing List for the Scn-Internet-Archive: To be able to cooperate with each other on this easily. No one of us needs a Backup of his hard drive as we are in comm with each other and if one of the hard drives fails, we exchange among us. But there is of course also the possibility for people to just use our data and keep in hiding. No problem.

Regulary download sessions of all the Scn-related-sites

After we downloaded the sites for some times, we will realize, which sites are down and can mark this in our tbl_URLs and which are stable (no much changes) or which are very activ in updates. We mark this information in our tbl_URLs so that we can make more frequent backups with the more active sites and more seldom backups with the stable sites. - With a group of people doing this, we could also without much risk share the work among us: I would i.e. download a more stable site just once a year, if I know that in case of an emergency (the site wents offline within this year and perhaps there are some updates I missed), I could get a copy from one of my friends which is next to the date of going offline. - By this 12 friends could coordinate their downloads of such sites in the way that every other month one of them download them all. Each friend downloads all once in a year. But in case we need an actual copy, we know, who may have it. This saves download-time, working-time and harddrive-space.

Further project: An online-Scientology-Webarchive

Similar to the Way Back Machine we could consider to put our collection online and give access to everyone. But this will need some more ressources and could be done later. My intention is to collect everything in time, as long as it is available.

Who is interested to work with me on this?

I am looking for friends, who are willing to cooperate on this project. Do you know of someone working on this or interested in this?

yours

Andreas

@@ Line 77: / Line 77: @@
 ===Checkout the software and hardware used===
-Perhaps before we start we should compare several competing software-tools like WinHTTrack, even if they cost some money, this could be well invested if it saves working time and hard-drive-space.
+Perhaps before we start we should compare several competing software-tools like WinHTTrack (http://www.httrack.com), even if they cost some money, this could be well invested if it saves working time and hard-drive-space.
 For example I think it is necessary like in the way back machine to check and download every site at least monthly. But we should not store identical pages, only new pages or changed pages. Is this possible with our tool? This would save a lot of hard drive space and make it possible to check and download sites very frequently.

Difference between revisions of "Project on many Offline Internet-Archives on the subject of Scientology, Freezone, Critics"

Revision as of 00:12, 1 September 2008

Contents

Plan

Situation

Solution

The steps of the project

We collect collections of URLs

Entering all Scn-URLs in a table

Consolidation of the data in tbl_URLs

Checkout the software and hardware used

Document our findings on the steps until now for the public

Create a Mailing List for the Scn-Internet-Archive

Regulary download sessions of all the Scn-related-sites

Further project: An online-Scientology-Webarchive

Who is interested to work with me on this?

Navigation menu

Views

Personal tools

Navigation

External Links

Search

Tools