While of course I endorse as diverse a reading of the preprint archives as is possible, we all know that there’s simply too much that goes on there for every paper to get even the most cursory of glances, let alone a proper viewing. My strategy has been to reduce the daily postings to a manageable number using a perl script that weights by keywords in the title and abstracts of papers. This is a technique first implemented by Richard Edgar, that was brought to my attention indirectly by R. J. Massey, when one of my papers appeared in his list, which was in turn linked from Google.
I have modified Edgar’s original script to bring it to what passes in my mind for being up-to-date. This post describes the filtering script and gives instructions for running it yourself, should that be desirable, or for modifying it to your own taste, should it not.
The script does the following:
- Acquires the arXiv feed of astro-ph;
- Using the XML::FeedPP module in Perl, parses the feed to obtain the titles and abstracts as raw text blocks;
- These blocks are ranked based on a weighted list of keywords;
- The list is re-ranked and, optionally, truncated, in the case of my page at n = 10 articles;
- The list is re-assembled into a feed, which is written to this page;
The script is run each morning at 7am GMT and the output piped into Google Reader. Obviously one can just pipe the raw arXiv feed into Google as well, and originally this is what I did, but of course I quickly became frustrated with the large number of articles that were uninteresting to me. I believe that I don’t miss any important papers as a result of this truncation. Here is the script in its entirety (note that some auxiliary Perl modules are required; these may or may not be standard with your distribution; note also I have removed the automated copying of the output file to my html space because you won’t want that):
#!/usr/bin/perl
# This filter is designed to read
# the feeds from the arXiv
# and score the articles therein
# It then writes out a feed of its own!
#
# By Berian James, based on a script by Richard Edgar
use warnings;
use strict;
use lib "/home/jbj/pl/pm";
use Getopt::Long;
use XML::FeedPP;
my( $rcs, $rev, $rcsdate );
my( $configFile, $htmlFile );
my( @feeds, @articleList, %scores, @artItems);
my( $feedLocation );
my( $totalArticles, $youngArticles, $goodArticles );
my( $debug );
my $nArticles = 15; # Just put up ten best articles
# -------------------------------------
# Check command line options
$debug = "";
$configFile = "";
$htmlFile = "";
GetOptions( 'debug' => \$debug,
'config=s' => \$configFile,
'output=s' => \$htmlFile );
if( $debug ){
print "Debug mode enabled\n";
}
# -------------------------------------
# Attempt to read in the configuration file
&LoadDataFile( $configFile );
# -------------------------------------
# Process each feed, to extract the articles
foreach $feedLocation( @feeds ){
&ProcessFeed( $feedLocation );
}
$totalArticles = @articleList;
print "\nFound $totalArticles articles\n";
# ------------------------------------
# Score the articles, and weed out the unworthy
&ProcessArticles;
$goodArticles = @articleList;
print "\nFound $goodArticles scores high enough\n";
# ------------------------------------
# Sort the list
@articleList = sort CompareArticles @articleList;
@articleList = @articleList[1..10];
$goodArticles = @articleList;
print "\nUsing best $goodArticles articles.\n";
# ------------------------------------
# Write the output file
my $feed = XML::FeedPP::RDF->new();
$feed->title( "astro-ph updates on arXiv.org" );
$feed->link( "http://www.arxiv.org/" );
$feed->description( "Astrophysics (astro-ph) updates on the arXiv.org e-print archive" );
foreach my $article (@articleList) {
@artItems = split(/!!/,$article);
my $item = $feed->add_item( "$artItems[3]" );
$item->title( "$artItems[1]" );
$item->description( "$artItems[2]" );
}
$feed->to_file( "$htmlFile" );
print "\nastroph-filter complete\n";
exit;
# =============================================
sub LoadDataFile{
# Reads in the data file supplied in the argument
# Fills in the globals @feeds
# @scores
# $htmlFile
# $minScore
# $maxAge
my( $inFile ) = @_;
my( @configLines, $line );
if( $debug ){
print "\nEntering LoadDataFile\n";
}
# Open the file, and slurp it in
open( myConfig, "<".$inFile ) || die "Can't open configuration file\n";
@configLines = <myConfig>;
close( myConfig );
# Loop over each line, determining what it is
foreach $line ( @configLines ){
if( $line =~ /^\#/ ) {
# Comment lines start with "#"
}
elsif( $line =~ /^Feed:\s*(.*)/ ){
# We have a new feed
push( @feeds, $1 );
if( $debug ){
print "Feed location: $1\n";
}
}
elsif( $line =~ /^(\S+)/i){
my @tmp = split(/\s+/,$line);
my $tmpscore = pop(@tmp);
my $keyword = lc(join(' ',@tmp));
$scores{$keyword} = $tmpscore;
# We have a word score
# The 'i' on the end makes match case-insensitive
# And what about that little 'lc' command? (probably lower case)
#$scores{lc($1)} = $2;
if( $debug ){
print "New pattern \"$keyword\" with score $tmpscore\n";
}
}
}
if( $debug ){
print "Configuration file loaded\n";
}
}
# =============================================
sub ProcessFeed{
# Processes the given feed, extracting the astro-ph
# articles
# These are added onto the global @articles
my( $source ) = @_;
my( $feed, $currArticle );
if( $debug ){
print "\nProcessing feed $source\n";
}
# Open up the feed
# Return if we get nothing
unless( $feed = XML::FeedPP->new( $source ) ) {
print STDERR "Can't open $source\n";
return;
}
foreach my $item ( $feed->get_item() ) {
# print "URL: ", $item->link(), "\n";
# print "Title: ", $item->title(), "\n";
# print "Abstract: ", $item->description(), "\n";
$currArticle = "Title: " . $item->title()."\n";
$currArticle = $currArticle . "Abstract: " . $item->description() ."\n";
$currArticle = $currArticle . "URL: " . $item->link() ."\n";
push(@articleList,$currArticle);
}
if( $debug ){
print "Feed processed\n";
}
}
# =============================================
sub ProcessArticles{
# Routine which goes through all the articles,
# scores them, and accumulates them onto
# the output list
my( @outputArticles, $article );
my( $title, $authorText, $abstract, $articleURL );
my( $matchText, $articleScore );
my( $wordCount, $word );
my( $artLine );
if( $debug ){
print "\nProcessing main article list\n";
}
foreach $article( @articleList ){
# Chomp out the new lines - they only make things
# difficult... and get rid of those pesky HTML
# tags in the Abstract
$article =~ tr/\n/ /s;
# Check we can extract title etc.
if( $article =~ /Title:\s(.*)Abstract:\s(.*)URL:\s(.*)/m){
$title = $1;
$abstract = $2;
$articleURL = $3;
# Score the article
$articleScore = 0;
$matchText = $abstract.$title;
foreach $word ( keys %scores ){
# Match the keywords
$wordCount = ( $matchText =~ s/($word)/$1/gi );
$articleScore += $wordCount * $scores{$word};
}
if( $debug ){
print "Title: ", $title,"\n";
print "Score: ", $articleScore,"\n";
}
# This join is split in the sort, and on
# final output
$artLine = join( "!!",
$articleScore,
$title, $abstract, $articleURL );
push( @outputArticles, $artLine );
}
}
# Assign back
@articleList = @outputArticles;
if( $debug ){
print "Articles processed\n";
}
}
# =============================================
sub CompareArticles{
# Compares two articles, as packed above
# Used by the sort routine
my( @art1, @art2 );
@art1 = split(/!!/,$a);
@art2 = split(/!!/,$b);
$art2[0] <=> $art1[0];
}
So that’s the script. The command-line invocation is
~/scripts/arXivFilter --config=/home/jbj/scripts/config.arXivFilter --output=/home/jbj/scripts/astroph.htm
and the list of keywords (config.arXivFilter) that I use is
# Astro-ph filter configuration file Feed: http://export.arxiv.org/rss/astro-ph galaxy 0.5 galaxies 0.5 universe 3 supernovae type Ia 3 correlation function 2 angular correlation function 1 projected correlation function 1 genus 4 Minkowski functional 4 Gaussian random 1 Gaussian field 1 Gaussian fields 1 topology 2 cosmology 3 large-scale structure 3 density field 2 halo 4 dark matter 2 structure formation 5 w_p 5 cosmological constant 2 statistics 1 likelihood 2 cmb 1 lensing 1 weak lensing 1 gravitational lensing 1 Freidmann equation 4 inflation 2 dark energy 2 solar -1 star -0.2 stellar -0.3 magnetic -1 braneworld -0.5 jet -0.2 redshift 0.5
To anyone working on magnetic solar braneworlds: I promise to talk to my colleagues regularly about this topic so that if you publish an important paper in the area I will quickly learn of it.
The astro-ph filter config file could serve another purpose …
The phrase “include a statement of research interests”, commonly found on post-doc job notices, could be replaced with “include your astro-ph filter config file, normalised to 100.”
“Include a statement of
disuninterests as well,” might be an apt request. I can see many an interview progressing: “So Mr James, you say you like everything. Well, we have this post-doc on jets from magnetic solar braneworlds. No? Really? How narrow your interests must be.”Personally, I think of my research interests as merely well-collimated.
Magnetic solar braneworlds?
Cooooool.
Berian: I think you mean “statement of uninterests”. We should be disinterested in all scientific topics, otherwise nothing we say can be trusted.
Yours pedantically,
Grumpy Old Man
(Blame Andy Lawrence’s blog for this outburst)
Ugh. I shall have to turn over my pedant’s card if there are any more slips up.
I suppose one instinctively gravitates away from ‘uninterests’ because it hits the ear about as elegantly as a stiletto through plate glass.
This leads me to another important set of disclosures: the ‘statement of misinterests’, in my case extending to algebraic topology, computational geometry and most of modern philosophy. Such diversity is to be cherished only when one is making disparate seminal contributions subsequently reconciled in the posthumously published magnus opus; otherwise it is too easily mistaken for dithering.
Write more, thats all I have to say. Literally, it seems as though you relied on the video
to make your point. You obviously know what youre talking about, why waste your intelligence on just posting videos to
your site when you could be giving us something enlightening to read?