Unpronounceable — why can't people give bioinformatics tools sensible names?

June 13, 2014 by Keith Bradnam

Okay, so many of you know that I have a bit of an issue with bioinformatics tools with names that are formed from very tenuous acronyms or initialisms. I've handed out many JABBA awards for cases of 'Just Another Bogus Bioinformatics Acronym'. But now there is another blight on the landscape of bioinformatics nomenclature…that of unpronounceable names.

If you develop bioinformatics tools, you would hopefully want to promote those tools to others. This could be in a formal publication, or at a conference presentation, or even over a cup of coffee with a colleague. In all of these situations, you would hope that the name of your bioinformatics tool should be memorable. One way of making it memorable is to make it pronounceable. Surely, that's not asking that much? And yet…

GO2MSIG, an automated GO based multi-species gene set generator for gene set enrichment analysis – This is not so hard to pronounce (go-to-em-sig), but it is a little awkward and not very memorable.
AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data — I guess this only has one obvious pronunciation (abs-see-en-seq), but again not particularly memorable.
QCGWAS: A flexible R package for automated quality control of genome-wide association results — This sort of works if you separate out the two commonly used initialisms (QC + GWAS), but maybe not everyone will spot this straight away (especially if you are not familiar with GWAS). I still find this a bit of mouthful to say (cue-see-gee-was).
CMGRN: a web server for constructing multilevel gene regulatory networks using ChIP-seq and gene expression data — The lack of vowels means that can only ever be pronounced by uttering every consonant separately (see-em-gee-ar-en).
iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition — I don't know where to start with this one! Imagine that you had to spell this out to a journalist over the phone (something that can happen in science!): "The software name? Yes, it's aye (lower-case), en (upper-case), you-see (lower-case), hyphen, pee (upper-case), ess-ee (lower-case), and kay-en-see (upper-case)…hello, are you still there?".
MFSPSSMpred: identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation — Couldn't be simpler really. I look forward to telling my colleagues about em-eff-ess-pee-ess-ess-em-pred.
mRMRe: an R package for parallelized mRMR ensemble feature selection — This is not as long as some of the others, but trying saying this five times fast (em-ar-em-ar-ee).
LoQAtE—Localization and Quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast — I get the feeling that this is meant to be pronounced 'LOCATE', but that's only a guess. Maybe it's really pronounced low-queue-at-ee? It's clumsy, ugly, and also an incredibly tenuous initialism.
HoPaCI-DB: host-Pseudomonas and Coxiella interaction database — This, like many of the above entries, also featured as a JABBA award recipient. This is not as bad an acronym/initialism as others, but it ranks highly for its lack of obvious pronunciation. Is it ho-pa-cee-aye-dee-bee, hop-pah-cee-aye-dee-bee, ho-pa-sigh-dee-bee, or even ho-pack-ee-dee-bee???

There is a lot of bioinformatics software in this world. If you choose to add to this ever growing software catalog, then it will be in your interest to make your software easy to discover and easy to promote. For your own sake, and for the sake of any potential users of your software, I strongly urge you to ask yourself the following five questions:

Is the name memorable?
Does the name have one obvious pronunciation?
Could I easily spell the name out to a journalist over the phone?
Is the name of my database tool free from any needless mixed capitalization?
Have I considered whether my software name is based on such a tenuous acronym or intialism that it will probably end up receiving a JABBA award?

101 questions with a bioinformatician #10: Lex Nederbragt

June 11, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

This is the third 'binary' post in this series — where the interviewee number consists of just ones and/or zeros. If this fact makes you excited, then you probably need to get out more.

Lex Nederbragt works as a Bioinformatician at the Norwegian Sequencing Centre (where they probably do more than just sequence Norwegians). He is also an Associate Professor at the Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo.

As a Dutchman living in the least populous of the three Scandinavian Kingdoms, Lex can take comfort in knowing that the Netherlands retain the upper hand in their battles with Norway on the football field.

Away from football — and this is the last chance you'll have to get away from football for the next few weeks — Lex is someone who posts fantastic amounts of useful information on his blog. If you have any interest in high-throughput sequencing and assembly, then you owe it to yourself to follow his blog updates.

You can find out more about Lex by following him on twitter (@lexnederbragt), or reading his aforementioned blog (In between lines of code) or his other blog…presumably the world's only blog devoted to the Newbler assembler.

And so on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

The increasing focus on reproducibility and reusability. Making sure others can reproduce your work is such a fundamental aspect of science, and computational work should be easy to reproduce in principle. It is fascinating to see how difficult this turns out to be in practice — even in cases where the description of the work is very complete.

010. What's something that you *don't* enjoy about current bioinformatics research?

I'm not the first one to complain about the seemingly unlimited growth in tools meant for the same job, e.g., short read mappers. My field of interest is de novo genome assembly, and there too new tools appear regularly. I think it is about time we settle on a set of tools that appear to be best suited for the job, and move on to finding ways to determine which tools works best for each individual dataset and research question. In the case of assembly, we basically already know the set of programs that generally perform well. Now we need to develop and implement evaluation tools that tell a researcher which assembly of the data is the best one for their purposes.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I am a bit ambivalent here. It took me a long time to realize that I wanted to become a bioinformatician, I missed a lot of signals how much I enjoyed programming, for example. So, I would like to tell myself to explore computational science much more than I did. On the other hand, waiting this long to make the switch to bioinformatics meant I have acquired a very firm background in biology. I find this essential for my work, as it allows me to make connections between the technological aspects of high-throughput sequencing experiments and data analysis, and the biological questions that inspired the experiments in the first place. So, I would also like to tell myself to keep on studying biology.

100. What's your all-time favorite piece of bioinformatics software, and why?

The Newbler assembly and mapping program from Roche/454 Life Sciences. It is not the program per se (it's good, but not necessarily the best; nor is it open source, for that matter). However, it is through the use of this program I was propelled into bioinformatics. I became very familiar with it and started scripting to massage its output. I even wrote a user-oriented manual for Newbler. These days, I use many more assembly programs besides Newbler, but my bioinformatics 'roots' will always be Newbler.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

B, as it stands for 'C or G or T', so it is flexible, allowing several alternatives and keeping options open. But it also means knowing your limits, not everything goes. I also like to have a 'plan B' in the back of my head.

How to make your genomics website more suitable for an English-speaking audience

June 05, 2014 by Keith Bradnam

Today I visited the website of the Beijing Institute of Genomics (BIG) for the first time. BIG is not to be confused with BGI (which was formerly known as the Beijing Genomics Institute). If you look at just about any web page on this site other than the home page (which contains an unusual visual element), you'll see the following image:

My sharp, British-born, eyes quickly recognized this as the UK's Houses of Parliament in London (well technically it's the Palace of Westminster). See this image for a comparison. I then noticed that this image doesn't feature on the Chinese language version of the website (which has a completely different design).

I can only assume that some web designer thought that an image like this would be fitting because it is the English-language version of the website, and that they therefore chose an image of something (incorrectly) deemed to be English. At this point, I feel obliged to share the following video which offers a definitive explanation as to the differences between England, Great Britain, and the United Kingdom:

Reflections on my '101 questions with a bioinformatician' series

June 04, 2014 by Keith Bradnam

This is in lieu of a regular '101 questions with a bioinformatician' post which has been delayed (hopefully by only a day). This series of interviews has now been running for over 2 months and — judging by my web stats — it seems to be popular. In fact, these posts now account for the majority of traffic to this site.

Thanks to everyone who has contributed so far, and for everyone who has been reading these interviews. It's been fun doing this and I've enjoyed seeing the variety of answers that people have provided.

I should confess that I'm solely responsible for adding hyperlinks to the answers that people provide, and in addition to adding links for obvious items like pieces of bioinformatics software, I sometimes like to have a bit of fun with what I choose to link to. E.g. see the links I added to question 101 in my interview with Holly Bik.

To finish off, here are some relevant numbers about this series:

10 — number of interviews posted
2 — number of interviews finished and (almost) ready to be posted
6 — number of people who have agreed to be interviewed but haven't yet sent me their answers (cough, cough).
81 — my current list of 'potential interviewees'

The last point means that hopefully I can keep this series going for a while longer. I guess that I now have to aim for an interviewee #101, (which would be the 102nd interview…obviously).

Still collecting results for my survey about gender bias in bioinformatics

May 30, 2014 by Keith Bradnam

A quick post just to say that although I published some preliminary results from my survey about gender bias in bioinformatics, I left the survey live so that others could still add their responses. So far, I've had 28 more responses on top of the original 370.

I also tweaked the survey form to allow ex-bioinformaticians to respond (and I asked whether they left bioinformatics as a career because of gender bias). If you haven't done so, please complete the form (embedded below) or available here. I'll try to update the main results on Figshare in a few weeks. Hopefully, with some more results it will be possible to see if there are other notable patterns in the results.

101 questions with a bioinformatician #9: Tuuli Lappalainen

May 28, 2014 by Keith Bradnam

Tuuli Lappalainen is a Group leader at the New York Genome Center, an institution that's so new, that their Illumina HiSeq X Ten is counted as one of their older sequencing machines. In addition to having possibly the coolest logo for a genomics/bioinformatics institute, they also have an impressive set of green credentials. And did I mention that it's in New York, New York? Start spreading the newwwss…

Sorry, I got distracted.

Tuuli is also an assistant professor at the Department of Systems Biology at Columbia University. Her work focuses on using high-throughput sequencing data to study functional genetic variation in human populations. Her website — paraphrasing Dobzhansky — puts it like this:

Nothing in the genome makes sense except in the light of the transcriptome

You can find out more about Tuuli by following her on twitter (@tuuliel) or by checking out her lab's website. Oh, and Tuuli is looking for a talented post-doc to join her lab (she didn't ask me to say that, it's all part of the service). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I have very little interest in methods for the sake of methods; for me it's all about understanding biology, and bioinformatics provides fantastic opportunities for that.

010. What's something that you *don't* enjoy about current bioinformatics research?

The working environment that is local when data and analyses are increasingly global is driving me insane. I've done (and still do) a lot of consortium work, where all of us still end up copying large data files to our local servers, and having locally optimized pipelines and scripts that are impossible to transfer to colleagues. I know that many people are trying to solve the problem, and I hope we'll be able to make it happen soon. And then there are the complications of applying and getting access to various datasets. Privacy concerns are important, but does dbGap really need to be so difficult to use? Our open access data set from GEUVADIS (Genetic European Variation in Health and Disease) is a great exception to this.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn more stats, math, proper programming. It's great to see how the younger generations have formal training in so many of the skills that I've had to just pick up the along the way — I'm a biologist by training and proud of it, but in the early 2000's computational biology was still very marginal.

100. What's your all-time favorite piece of bioinformatics software, and why?

My two current favorites are pysam for handling BAM/SAM files — fast, great syntax, and much more versatile than alternatives — and Matrix eQTL for very fast eQTL analysis.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

T for Tuuli!

Is the naming of bioinformatics software getting out of CoNtRol?

May 22, 2014 by Keith Bradnam

There is a new paper in the journal Bioinformatics. This is the title of that paper:

CoNtRol: an open source framework for the analysis of chemical reaction networks

Now people will know that I have no stomach for bogus bioinformatics acronyms and initialisms, so is CoNtRol worthy of a JABBA award? Well I can't give it such an award because CoNtRol is not an acronym or an initialism. At least I don't think it is.

The abstract describes CoNtRol as a web-based framework for analysis of chemical reaction networks (CRNs). So even though the capitalized letters in CoNtRol give you CNR, maybe it's really all about CRNs???

The CoNtRol website makes things a little more confusing by starting their introduction with the text: CoNtRol (CRN tool) is a web application. Are you now thinking what I'm thinking? Is CoNtRol the world's first bioinformatics software based on an anagram (CoNtRol = CRN tool)? If this isn't the reason, then I can only assume that someone decided to just randomly capitalize various letters in the name.

Whatever the reason for the name, the more practical issue is that these tools can often be hard to find with web search engines. It doesn't show up on the first page of Google results if you search for control bioinformatics web app. Nor does it show up if you search for control chemical network app. There is something to be said for giving software novel names.

101 questions with a bioinformatician #8: Nick Loman

May 21, 2014 by Keith Bradnam

Nick Loman is an Independent Research Fellow in the Institute of Microbiology and Infection at the University of Birmingham, UK. You may know Nick for his involvement in producing the only world map of high-throughput sequencers (at least I'm assuming that this is the only map of its kind…I'm too lazy to check). Maybe you know him for the exclusive interview that he managed to secure with some of Oxford Nanopore's head honchos at the 2012 AGBT meeting (the scene of a certain wowser moment in high-throughput sequencing). Or maybe you just know Nick for his epicurean passions.

I like to think of Nick as the Jack of Clubs in the deck of cards that is the bioinformatics blogging community (this works as a metaphor, right?). Actually, on some days he's more like the Ten of Diamonds, but then he goes and writes great pieces like this (co-authored with fellow 101 alumni Mick Watson):

So you want to be a computational biologist? Nature Biotechnology, 2013

If you are interested in bioinformatics, and if you want to keep up with the latest developments in high-throughput sequencing technology, then you really should be keeping a close eye on people like Nick (though not too close, give the man some privacy!).

You can find out more about Nick by following him on twitter (@pathogenomenick) or keeping up with his excellent blog (Pathogens: Genes and Genomes). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I mainly enjoy the daily battles with crashing servers with cryptic memory errors, incompatible software versions, buggy scripts (mine and others) and full hard drives.

Hah! That was the famous British sarcasm you will have read about.

The obvious answer is that the projects I get involved in are incredibly diverse, and I get to interact with many interesting people, because sequencing and bioinformatics skills are in such demand.

Another thing I enjoy is that I can reach out, via Twitter and blogging, to discuss with all the great computation biologists in the world struggling with the same problems. I have no idea what it must be like to feel isolated and slog away in a windowless laboratory without that kind of communication.

010. What's something that you *don't* enjoy about current bioinformatics research?

I whinge quite a lot on my Twitter feed, but I wish bioinformaticians (including myself) wouldn't spend so much time reinventing the wheel (Keith: it's bioinformatics sin number 1 on this list), and instead try and muck-in together to solve really important problems.

A model of bioinformatics research a bit more like the Linux kernel might work. Imagine an international network of committed bioinformaticians working together. We would achieve great things quickly. But the academic model of recognition is broken for things like this, where everyone needs their own papers to justify their positions.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I guess I would have got into the details of Bayesian statistics and machine learning earlier. These skills are very useful and I am only picking them up properly just now (I am on a Medical Research Council Training Fellowship).

Probably would have slipped myself a copy of Grays Sports Almanac too.

More prosaic: GNU parallel I discovered way too late and is an essential tool. And screen.

100. What's your all-time favorite piece of bioinformatics software, and why?

There's very little you can't get done with BLAST. It has its funny little quirks, but you know where you are with it.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

Well, it would be rather British to suggest T. But I prefer coffee.

Developing CEGMA: how working on old code can drive you mad and some tips on how to avoid this

May 20, 2014 by Keith Bradnam

Today marks the day when the original paper that describes the CEGMA software (Core Eukaryotic Gene Mapping Approach) becomes my most cited paper (as tracked by Google Scholar):

Does this fact make me happy? Not really. In fact, you may be surprised to learn that I find working on CEGMA a little bit depressing. I say this on a day when, purely coincidentally, I am releasing a new version of CEGMA. Why the grumpy face Keith? (I hear you ask). Let's take a trip down memory lane to find out why:

Early 2004: A paper is published that describes the KOGs database of euKaryotic Orthologous Groups.
Early 2005: I become the first person to join the Korf Lab after Ian Korf moves to Davis in 2004.
Mid 2005: Genís Parra becomes the second person to join the lab.
2005–2006: The three of us work on the idea which became CEGMA. This project was primarily driven forward by Genís; during this time our initial CEGMA manuscript was rejected by two journals.
Late 2006: Our CEGMA paper was accepted!
Early 2007: CEGMA paper is published — as an aside, the URL for CEGMA that we include in the paper still works!
2007: We work on the CEGMA spin-off idea: that it can be used to assess the 'gene space' of draft genomes.
2008: Write new manuscript, get rejected twice (again), finally get accepted late 2008.
Early 2009: The 2nd CEGMA paper gets published!
Mid 2010: Genís leaves the lab.

By the time Genís left Davis, our original CEGMA paper had been cited 11 times (one of which was by our second CEGMA paper). I think that we had all expected the tool to have been a little more popular, but our expectations had been dampened somewhat by the difficulties in getting the paper published. Anyway, no sooner than Genís had left the lab, then the paper started getting a lot more attention:

Growth in citations to the two CEGMA papers.

This was in no doubt due to its use as a tool in the Assemblathon 1 paper (of which I was also involved), a project that started in late 2010. However, any interest generated from the Assemblathon project probably just reflected the fact that everyone and their dog had started sequencing genomes and producing — how best to describe them? —'assemblies of questionable quality'.

This is also about the time when I started to turn into this guy:

This was because it had fallen on me to continue to deal with all CEGMA-related support requests. Until 2010, there hadn't really been any support requests because almost no-one was using CEGMA. This changed dramatically and I started to receive lots of emails that:

Asked questions about interpreting CEGMA output
Reported bugs
Asked for help installing CEGMA
Suggested new features
Asked me to run CEGMA for them

I started receiving lots of the latter requests because CEGMA is admittedly a bit of a pig to install (on non Mac-based Unix systems at least). In the last 6 months alone, I've run CEGMA 80 times for various researchers who (presumably) are unable to install it themselves.

After the version 2.3 release — necessary to transition to the use of NCBI BLAST+ instead of WU-BLAST — and 2.4 release — necessary to fix the bugs I introduced in v2.3! — I swore an oath never to update CEGMA again. This was mostly because we no longer have any money to work on the current version of CEGMA. However, it was also because it is not much fun to spend your days working on code that you barely understand.

It should be said that we do have plans for a completely new version of CEGMA that will — subject to our grant proposal being successful — be redeveloped from the ground up, and will include many completely new features. Perhaps most importantly — for me at least — a version 3.0 release of CEGMA will be much more maintainable.

And now we get to the main source of my ire when dealing with CEGMA. It is built on a complex web of Perl scripts and modules, which make various system calls to run BLAST, genewise, geneid, and hmmsearch (from HMMER). I still find the scripts difficult to understand — I didn't write any of the original code — and therefore I find it almost impossible to maintain. One of the reasons I had to make this v2.5 update is because the latest versions of Perl have deprecated a particular feature causing CEGMA to break for some people.

Most fundamentally, the biggest problem with CEGMA (v2.x) is that it is centered around use of the KOGs database, a resource that is now over a decade old. This wasn't an issue when we were developing the software in 2005, but it is an issue now. Our plans for CEGMA v3.0 will address this by moving to a much more modern source of orthologous group information.

In making this final update to v2.x of CEGMA, I've tried adopting some changes to bring us up to date with the modern age. Although the code remains available from our lab's website, I've also pushed the code to GitHub (which wasn't in existence when we started developing CEGMA!). In doing this, I've also taken the step to give our repository a DOI and therefore make the latest version citable in its own right. This is done through use of Zenodo.

Although I hope that this is the last thing that I ever have to write about CEGMA v2.x, it is worth reflecting on some of the ways that the process of managing and maintaining CEGMA could have been made easier:

Maintain documentation for your code that is more than just an installation guide and a set of embedded comments. From time to time, I've had some help from Genís in understanding how the code is working, but the complexity of this software really requires a detailed document that explains how and why everything works the way it does. There have been times when I have been unable to help people with CEGMA-related questions because I still can't understand what some of the code is doing.
Start a FAQ file from day one. This is something that, foolishly, I have only recently started. I could have probably saved myself many hours of email-related support if I had sorted this out earlier.
Put your code online for others to contribute to. Although GitHub wasn't around when we started CEGMA, I could have put the code up there at some point before today!
Don't assume that people will use a mailing list for support, or even contact you directly. One thing I did do many years ago, is set up a CEGMA mailing list. However, I'm still surprised that many people just report their CEGMA problems on sites like SEQanswers or BioStars. I probably should have started checking these sites earlier.
Don't underestimate how much time can be spent supporting software! I probably should have started setting aside a fixed portion of time each week to deal with CEGMA-related issues, rather than trying to tackle things as and when they landed on my doorstep.
Assume that you will not be the last person to manage a piece of software. There are many things you can do to start good practices very early on, including using email addresses for support which are not tied to a personal account, ensuring that your changes to the code base have meaningful (and helpful) commit messages, and making sure that more than one person has access to wherever the code is going to end up.

In some ways it is very unusual for software to have this type of popularity where people only start using it several years after it is originally developed. But as CEGMA shows, it can happen, and hopefully these notes will serve as a bit of a warning to others who are developing bioinformatics software.

Fun with an 'error message' from NCBI BLAST+

May 16, 2014 by Keith Bradnam

Consider this very simple DNA sequence in FASTA format:

>sequence1
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

If you try converting this to a BLAST database using the 'makeblastdb' command from the latest version of NCBI's BLAST+ suite, you will see the following line included in the output:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)

Now consider what happens if you run the same makeblastdb command on this sequence (which just switches the first two lines of sequence1):

>sequence2
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Although this sequence has the exact same proportion of As, Cs, Gs, Ts, and Ns, it does not produce the error message. What about the following sequence?

>sequence3
nnnac
ttttttagaaaaattatttttaagaatttttcattttaggaatattgtta
tttcagaaaatagctaaatgtgatttctgtaattttgcctgccaaattcg
tgaaatgcaataaaaatctaatatccctcatcagtgcgatttccgaatca
gtatatttttacgtaatagcttctttgacatcaataagtatttgcctata
tgactttagacttgaaattggctattaatgccaatttcatgatatctagc
cactttagtataattgtttttagtttttggcaaaactattgtctaaacag

Well, surprise surprise, this sequence produces the error message again (though it now tells you that the first line 'is about 60% ambiguous nucleotides'). You will still see the same message even if you added 1 billion As, Cs, Gs, and Ts on to the end of sequence 3. This seems to be the code responsible for the error message (taken from this page):

In case it wasn't obvious, here is why this annoys me:

The comment in the code indicates that this should be treated as a warning (less serious), but then the message starts with a prefix of 'Error' (more serious). So it's an warning and an error?
It only considers the first line of sequence data. I appreciate that this is easiest thing to check, but it is not very useful if all of your ambiguous bases are at the end of the sequence (or anywhere past the first line).
What is the rationale for choosing 40% as the threshold for warning? It seems a little too arbitrary.
It produces this warning if the first line at least 40% ambiguous and if it also has a length greater than 3 bp! This means that it can be triggered with a line that starts 'NNNAC' as in my sequence3 example above.
It considers all ambiguity codes as being equal. So if I switched my first line of sequence3 from NNNAC to RWBAC, it would still be rejected even though the sequence RWB contains much more information than NNN.
The way the output text bluntly says 'shouldn't be over 40%' comes across as very matter-of-fact, as if you've transgressed some unknown law of bioinformatics.

So here are my suggestions for an alternative (which admittedly requires some more coding):

If a sequence is less than 1,000 bp check all of the sequence, otherwise check the first 1,000 bp of sequence (if not more).
Report the output as a warning and not an error.
Change the warning message. E.g. 'The first 1,000 bp of your sequence contains a high proportion (X%) of ambiguous bases. Such sequences may not be very useful for any downstream analysis that you perform with BLAST+.'