drx: Request for removal

Request for removal is a Perl script that harvests all Message-IDs of articles from a certain author in the Google Groups service.

Details

From the 1980s to the middle of the 1990s, the Usenet was a very popular, message board like service on the internet. It existed long before the World Wide Web and is still in use today.

Only a few people had access to the internet and could use the Usenet. The community would usually consist of technically-minded, male people, discussing technical but also private topics. Only a few of them imagined that all of their postings would later get archived by the Web-Service Deja News, making them available to a new, gigantic audience during the internet boom of the 1990s.

When Deja News went out of business in 2001, Google purchased this and other archives of Usenet articles and re-released them under their own brand with Google Groups.

The problem was: A large number of discussion groups were used for chatting, local gossip exchange or publishing poetry etc ... Many people signed their messages with their real names or talked freely about persons that did not know about the internet -- and probably never ever would read anything there, right?

Today everybody's grandmother is online; what was a message intended for a small peer group became part of another Google product. Statements made a decade ago in an illusion of safeness, now searchable with Google's comfort, can lead to awkward situations.

Google offers to remove messages from their Google Groups database, but the process of getting one's Message-IDs is cumbersome, especially for people that have been very active on the Usenet.

Request for removal collects all the Message-IDs from an author, identified by the email address in the posting's From header. It saves a text file with the Message-IDs. These can be copy-pasted for use with Google's removal tool.

Requirements

The script requires the Perl programming language with the CPAN modules LWP and URI.

Issues

Code

Update 2008-11-07: Julian Wiersbitzki made a small update, implementing the following improvements:

#!/usr/bin/perl # ______ _______ _____ _ _ _______ _______ _______ # |_____/ |______ | __| | | |______ |______ | # | \_ |______ |____\| |_____| |______ ______| | # # _______ _____ ______ # |______ | | |_____/ # | |_____| | \_ # # ______ _______ _______ _____ _ _ _______ # |_____/ |______ | | | | | \ / |_____| | # | \_ |______ | | | |_____| \/ | | |_____ # This script asks Google Groups for Messages created by $author and # prints a list of all found Message-IDs. # v1.0 released by Dragan Espenschied # This Software is in the Public Domain. # v1.1 released by Julian Wiersbitzki # Changes: # - Google changed HTML-Code of Message-Body, $message_body customized. # - Also a file with Google-URLs for messages is created. These URLs can also be used as request for removal. use strict; my $author = 'i.am@example.com'; # author's email address goes here use LWP::UserAgent; use URI::Escape; my %groups_messages; # this hash will contain all found message IDs # Fake Browser my $ua = LWP::UserAgent->new(agent => 'Mozilla/5.0 (Linux; U; appSysName i686; de; rv:1.7.5) Gecko/20041108 Firefox/1.0'); my $result_page_counter = 0; # we are on this serp my $more = 1; # are new links found or did we # reach the end of the list? while($more == 1) { # while new links are found print "SERP page: $result_page_counter\n"; my $request_uri = # construct uri with serp number 'http://groups.google.com/groups?q=author%3A'.uri_escape($author). '&start='.$result_page_counter. '&hl=de&lr=&num=100&filter=0'; my $response = $ua->get($request_uri); unless($response->code == 200) { # test on HTTP error codes $more = 0; die "Error! $request_uri\nHTTP-Status: ".$response->code."\n"; } my $google_result_page = $response->content; # get SERP # Check on google spyware detection if($google_result_page =~ /<title>403 Forbidden<\/title>/m) { $more = 0; die "Error! Google thinks this is malicious software.\nPlease try again later.\n"; } # look on the serp for links that contain # a ref to /groups, catch group name and # google's hash identifier my $counter = 0; while($google_result_page =~ /<a\s+href="\/group\/(\S+)\/browse_thread\/[^"]+#([0-9a-z]+)"/sg) { unless(exists($groups_messages{$2})) { # this is a yet unknown message $groups_messages{$2}{group} = $1; # save group name and google hash identifier print "$2 -- $1\n"; $counter++; } else { # this message already appeared before $more = 0; # which means that we should search no more } } print "Found: $counter posts.\n"; if($counter < 100) { # if there are less than 100 new posts on the $more = 0; # SERP, this is the last page for this query } $result_page_counter += 100; # increase serp number sleep(int(rand(5))); # wait some time not to stress google too much } # open file for my $export_file = open(SAVE, "> message_ids_for_$author.txt") or die "could not save file: $!\n"; my $export_file2 = open(SAVE2, "> message_urls_for_$author.txt") or die "could not save file: $!\n"; # retrieve "source" of all found messages foreach my $google_hash (keys %groups_messages) { # uri contains group name and google's hash # identifier my $request_uri = 'http://groups.google.com/group/'.$groups_messages{$google_hash}{group}. '/msg/'.$google_hash.'?dmode=source&hl=de'; my $response = $ua->get($request_uri); unless($response->code == 200) { $more = 0; die "Error! $request_uri\nHTTP-Status: ".$response->code."\n"; } my $message_body = $response->content; # Check on google spyware detection if($message_body =~ /<title>403 Forbidden<\/title>/m) { $more = 0; die "Error! Google thinks this is malicious software.\nPlease try again later.\n"; } # find Message-ID from the header $message_body =~ /<pre>.+Message-ID: <(\S+)>.+<\/pre>/s; $groups_messages{$google_hash}{msgid} = $1; print "http://groups.google.de/group/$groups_messages{$google_hash}{group}/msg/$google_hash\n"; # check if message-ID is extracted if($1 == "") { # if not display message print "Message not found, probably already deleted...\n"; } else { # else print message-ID and message-URL to each files. print "$1\n"; print SAVE "$1\n"; print SAVE2 "http://groups.google.de/group/$groups_messages{$google_hash}{group}/msg/$google_hash\n"; } sleep(int(rand(5))); # wait some time not to stress google too much } close SAVE; close SAVE2;

Claude (2006-07-16, 20h19):

This is a great tool! However, I am using activePERL on Windows and I only get empty files.
reply
- Snapper (2006-08-07, 21h36):
  
  I've had the same problem. I've spent a long time searching the code for mistakes, but I can't find any. One solution, however, is to ditch the whole last "foreach" command. Then, replace the
  print "$2 -- $1\n";
  with
  my $export_file = open(SAVE, ">>message_ids.txt") or die "could not save file: $!\n";
  print "http://groups.google.com/group/$1/msg/$2?dmode=source&hl=en \n";
  print SAVE "http://groups.google.com/group/$1/msg/$2?dmode=source&hl=en \n";
  This will give you a file with all the urls you need, which can be used in the removal tool.
  reply
  - Claude (2006-09-26, 00h38):
    
    I hate Google....
    Your modification works, but when I enter in the Google removal tool I am only getting "Server Error". Isn't this convenient? Google seems to be in a quest to gather all information around the world -- not matter if you are willing to have it indexed or not...
    reply
Claude (2006-07-16, 20h28):

Maybe I noticed the problem. In your script show "hl=de", but for the US probably it should be "hl=en". I have no way to test now since Google disabled my access.
reply
Oliver (2006-11-23, 01h53):

When searching by email I get no results, when using my Name get plenty, but:
There is allways an error and nothing is written to the file:
Use of uninitialized value in concatenation (.) or string at test.pl line 116.
Use of uninitialized value in concatenation (.) or string at test.pl line 117.
Any ideas on this matter?
(Using: Windows @ ActivePearl)
reply
Jay (2007-08-22, 16h56):

Thanks so much for the tool. I was just about to make something like this, but you saved me a lot of time! With Snapper's suggested change, it works just fine in Ubuntu 7.04.
reply
Jay (2007-08-22, 17h37):

As an update, I figured out why it wasn't spitting out the actual message IDs - the script was looking for actual < > brackets, when the source code was using < and > instead.
So instead of:
$message_body =~ /<pre>.+Message-ID: <(\S+)>.+<\/pre>/s;
It should be:
$message_body =~ /<pre>.+Message-ID: &lt\;(\S+)&gt\;.+<\/pre>/s;
When works perfectly in Ubuntu. Thanks again!
reply
- Jay (2007-08-22, 20h56):
  
  Apparently the codes didn't show up! So again the brackets are supposed to be:
  & lt \;
  and
  & gt \;
  (remove the spaces after the & and before the \)
  reply
  - Dragan (2007-08-22, 21h11):
    
    Hi Jay, sorry for the trashup of your perl code. The original comment is now okay. I had a wrong regexp in the routine to display comments. Just like the one you corrected ... Thanks!
    reply
    - Sean (2009-04-03, 01h26):
      
      It's not working anymore. Anyone please know how to update it? I'm not good at it.
      reply