Mathematics: Newspaper Readability

Introduction

This is an investigation into the readability of articles in certain types of newspapers: broad sheet, tabloid and local newspapers. I use the scales of readability, given after a grammar check using Microsoft Word.

I have used over 60 articles from newspapers in US and UK. The Flesch scale is given as this represents a sound scale of readability, the lower the percentage, the more experienced the reader needs to be to understand the article, a high percentage means basic, short words are frequently used making it understandable by a large number of people. Tabloids are noticeably easier to read and broad sheets are aimed at a people with higher language skills.

This is taken from the help in Microsoft Word on their grammar checking formulae:

The Flesh Index computes readability based on the average number of syllables per word and the average number of words per sentence. Scores range from 0% to 100%. The average writing score is approximately 60% to 70%. The higher the score, the greater the number of people who can readily understand the document.

Help in Microsoft Word ‘95

Data

Newspaper typeName of newspaperArticle nameFlesch
Broad SheetBBC NewsHong Kong: UK – Dull But Tradeworthy54.4%
Broad SheetBBC NewsPope Makes Jewish-Born Nun Saint47.6%
TabloidChicago TribuneStopgap Action Prevents Shutdown43.9%
TabloidCornishmanIntimidation and Violence in Penwith64.3%
TabloidCornishmanMother Slams Drug Dealers61.1%
Broad SheetGuardianTwins Separated by No-Man’s Land60.8%
Broad SheetGuardianBonds Plunge Fuels Turmoil55.2%
Broad SheetGuardianPinochet Arrested In London40.6%
Broad SheetHeraldNATO waits For Milosevic36.9%
Broad SheetHeraldNew Sunday Newspaper for Scotland51.7%
Broad SheetIndependentGenetic Crops May Be Banned44.6%
Broad SheetIndependentDoctors Leaving NHS56.3%
Broad SheetIndependentKeep the Red Flag Flying56.7%
Broad SheetIndependentArts May Reap Millions52.5%
Broad SheetIndependentAlcopop on Sale Next to Sweets40.6%
Broad SheetIndependentRiots Rage in Hebron64.8%
Broad SheetIndependent on SundayChina Selects The Blair44.1%
Broad SheetLondon Evening StandardMaxwell Faces Court Action54.1%
TabloidMirrorFight to The Bitter End55.1%
TabloidMirrorPalace Must Match Our Good Sense66.3%
TabloidMirror£33m Fine for The Sugar Price Fixers65.2%
TabloidMirrorShearer For £18m64.2%
TabloidMirror on SundayClare’s Clanger69.2%
TabloidMirror on SundaySecret Cancer Battle68.4%
Broad SheetNew York PostStarr May Be Called to Testify45.8%
Broad SheetObserverBoosting Ranks of Black Police54.8%
Broad SheetTelegraph£0.5m Damages for Doctor63.9%
Broad SheetTelegraphThatcher’s Bag Rests in Peace50.6%
Broad SheetTelegraphShaw on Sex, Plum Cake, Morality60.7%
Broad SheetTelegraphHollywood Wins Fight40.2%
Broad SheetTelegraphCheap Holiday? Take a Sheep74.1%
Broad SheetTelegraphPlaywright Of The Century48.6%
Broad SheetTelegraphHousing Benefit Fraud unit Closed51.6%
Broad SheetTelegraphArmy’s Boycott of British Lamb62.1%
Broad SheetTelegraphPensioners Get Help with the Fuel Bill52.9%
Broad SheetTelegraphGod On a Web Site39.9%
Broad SheetTelegraphBeef on the Bone Ban is to be lifted59.0%
Broad SheetTimesSerbia Gives Way to Avoid NATO35.5%
Broad SheetTimesNeill wants £20m Cap on Funds50.0%
Broad SheetTimesForeigners Quit Belgrade33.8%
Broad SheetTimesLeftist May Lead Italy39.4%
Broad SheetTimesUnions Could Join Bosses on Euro40.8%
Broad SheetTimesTory Plot to Sink Archer Bid for Mayor50.2%
Broad SheetTimesBalkans’ Secretive Sparring Partners43.3%
TabloidUS TabloidMan Breaks Wind for 30 Years58.8%
TabloidUS TabloidContestant Kills Game Show Host77.0%
TabloidUS TabloidFaldo’s Birdie Tees Off on Porche66.3%
TabloidUS TabloidAlligator-Man Strikes71.0%
TabloidUS TabloidOil Rules Review Vowed59.8%
TabloidUS TabloidComputer Viruses Infect Humans55.5%
TabloidUS TabloidOne Arm Bandits69.3%
TabloidUS TabloidProstate Tumor Removed71.0%
TabloidUS TabloidSanta Cruz Eats Too Healthy75.6%
TabloidUS TabloidSatan Joins the Meter Maids71.0%
TabloidUS TabloidWolf-Calls from Gay Workers60.3%
TabloidUS TabloidBell’s Weird Farewell58.8%
TabloidUS TabloidRich Girl: 10 Most Wanted61.0%
TabloidUS TabloidMonica Talks77.3%
TabloidUS TabloidWoman Eaten by Parking Lot72.5%
TabloidUS TabloidPop Guns74.8%
TabloidUS TabloidBulls Gallop Back60.9%
TabloidUS TabloidWorst Fears Confirmed58.1%
TabloidUS TabloidGrisly Killing Stuns Friends62.4%
TabloidUS TabloidDelivery Three Weeks Early72.2%
TabloidUS TabloidSeinfeld Steals Wife65.0%
TabloidUS TabloidDysfunctional Family Eats50.3%
TabloidUS TabloidStinky New Hair Gel64.%
TabloidUS TabloidNaked Prof Calls It Art47.3%
TabloidWashington PostBomber and Boss at a Loss74.3%
TabloidWestern Morning NewsCrowds Are Given Rousing Send Off55.2%

None of the articles are stored as files as they were only set temporarily to check for grammar and the article titles have been abbreviated. My categorising of tabloids and broad sheets may not be completely correct as newspapers do not always state which type they are. Some newspapers don’t fall into these categories, but I have placed them into two groups for the purpose of analysing with as few variables as possible.

Aim

The purpose of this investigation is to judge the readability of two types of newspapers: broad sheets and tabloids. I will collect at least 50 pieces of raw data for this investigation in order to answer some questions.

  • Which newspapers are read by a wider audience?
  • Does the difficulty of the language determine what type of people read the newspaper?

Newspapers with a low readability are aimed at people would can understand longer words and long sentences. Low readability newspaper articles usually contain words with many syllables.

High readability newspaper articles can be understood by a greater range of people with varying intellect. Shorter words, sentences are concise and contain words with few syllables, making it generally easier to comprehend. Article readability, I feel, is worthy of study as it will determine what audience the paper is aimed at, and whether or not broad sheets will clearly have a lower readability than tabloids newspapers, which could be aimed at a greater population. It will be interesting to understand how certain people like to read certain types of newspaper.

Method

Newspapers articles will be taken from a search of newspapers on the internet. In each newspaper, the headlines will be copied and analysed with the grammar check in MS Word. Results from the grammar check will be typed into a database, where the most representative readability formula will be chosen as the numeric variable for this investigation. The search engine Yahoo can be instructed to randomly find a tabloid and broad sheet newspapers on the internet. This made the newspaper site selection random. Yahoo categorises web sites depending on the material contained on them, broad sheets and tabloids are found under two different categories, making the process of finding different types of article easier. An equal number of articles of each type will be used, for this investigation I will use 35 for each type of newspaper, making a total 70 articles to be analysed for their readability. I will use the first two articles found on the online newspaper to be included for checking of readability, repeated article subject matter will not be included to ensure a greater variation of data.

The parent population from which I may choose to collect the data from, is undefined. There is no way of finding how many newspaper articles can come from either broad sheet or Tabloid newspapers. The number of newspaper articles, in UK or US and published on the internet, is the parent population in this investigation.

Processing the data

There was a large population to find articles for the broad sheets category, many UK national papers also have their own web site, where the paper can be read free of charge. Tabloids were more scarce and I had to use US tabloids in order to have a large enough population to choose from. The articles were copied to MS Word, where they were grammar checked, a large amount of the newspaper articles used are included in this investigation.

Data was stored and sorted using a database table in MS Access. There was a great amount of data obtained from the grammar check of each article, the vast majority was unnecessary, as it did not convey readability or the paper type. I have chosen the Flesch readability scale as this incorporates all the points mentioned in the Aim (word, sentence length and syllables per word). This numeric data is calculated out of 100, but all the data lies in the centre of the possible range from 33.8 to 77.3. Here is an explanation of each of the readability formulas found after a grammar check:

Flesch Reading Ease

This index computes readability based on the average number of syllables per word and the average number of words per sentence. Scores range from 0 to 100. The average writing score is approximately 60 to 70. The higher the score, the greater the number of people who can readily understand the document.

Flesch-Kincaid Grade Level

This index computes readability based on the average number of syllables per word and the average number of words per sentence. The score in this case indicates a US grade-school level. For example, a score of 8.0 means that an eighth grader would understand the document. Standard writing approximately equates to the seventh-to-eighth-grade level.

Coleman-Liau Grade Level

This index determines a readability grade level based on characters per word and words per sentences.

Bormuth Grade Level

This index also determines a readability grade level based on characters per word and words per sentences.
The reading ease of an article is the formula I want to use in comparing broad sheet and tabloid newspapers, so the Flesch Reading Ease will be used. Here is a table of 35 articles, for each type of newspaper, with their Flesch reading ease from lowest to highest in each category. This table will be used as a source for displays and analysis.

From the table, a frequency table was drawn up, with class sizes of 10, in order to represent data as a histogram. I have used class sizes of 10 because outliers are more likely to be incorporated in the main distribution of the histogram A stem and leaf diagram would not be a suitable method of displaying grouped data of the Flesch reading ease as there are decimals over a large range and having a leaf for each integer would spread the data out too far for any meaningful analysis. If I were to round data to the nearest integer, and then plot a stem and leaf diagram data values would be changed, possibly making findings less accurate. A histogram shows the spread of grouped data, but decimals have no effect on the group size.

The frequency density is the frequency divided by the class width (evenly spaced group sizes of 10).

GroupFrequencyx midpointFreq. density
30 ≤ Φ < 405350.5
40 ≤ Φ < 5010451
50 ≤ Φ < 6014551.4
60 ≤ Φ < 705650.5
70 ≤ Φ < 801750.1
Broad sheet frequency density
GroupFrequencyx midpointFreq. density
30 ≤ Φ < 400350
40 ≤ Φ < 502450.2
50 ≤ Φ < 608550.8
60 ≤ Φ < 7015651.5
70 ≤ Φ < 8010751
Tabloid frequency density

There appears to be a uni-modal distribution in both sets of data, there is only high frequency in each group and the frequency falls either side of the modal group. From the frequency table the following histograms have been produced, the y axis shows the frequency density and the x axis is the Flesch reading ease in groups of 10. Both histograms are identical in size and scale on both axis, and are opposite each other for easy comparison.

Broad Sheet Flesch Histogram
Broad sheet flesch histogram
Tabloid Flesch Histogram
Tabloid flesch histogram

Analysis

The distribution of the broad sheets histogram is positively skewed while the tabloids histogram is negatively skewed and are both uni-modal. The modal class of the broad sheets is: 50 ≤ Φ < 60 and the tabloids is the class: 60 ≤ Φ < 70. There are no outliers shown on either histogram. The histograms are quite evidently different, meaning that the two types of papers are aimed at different reading abilities. Tabloids, from looking at the histograms, are generally easier to read, while broad sheets are mainly aimed for people who can understand more complicated words and phrases.

The standard deviation of the article readability for each type of newspaper would be very useful for analysing the average spread of the data and deciding whether broad sheets are more consistently harder to read than tabloid newspapers. Standard deviation shows the average spread of the data from the mean. I have used class sizes of 5 to calculate the standard deviation as this is more accurate for this purpose, than the larger groups of 10, used for the histograms to show the shape of the distribution. The mean of broad sheets = 50.36 and tabloids = 64.5.

Broad sheet standard deviation = Standard Deviation equation
sqrt(2619.12 – 2536.13) = 9.11

Tabloid standard deviation = Standard Deviation equation
sqrt(4226.25 – 4160.25) = 8.12)

The standard deviation for both types of newspaper is quite similar, but the average spread of the data of the broad sheets’ readability is greater than the tabloids’. Therefore, the tabloids are more consistently easier to read than the broad sheets, which are slightly less consistently harder to read.

As both sets of data are skewed, the broads sheets: negatively skewed and tabloids: positively skewed, it would be appropriate to find the median and interquartile range. Median is a good personification of the measure of data as it is not effected by outliers (in this case uncommon values of readability ease), I am looking for the typical value for each newspaper type. The process of calculating the median for the readability ease of broad sheets and tabloids is shown below…

Broad sheet median = ½(n – 1)
½ × 35 + ½ = 18th data value

Tabloid median = ½(n – 1)
½ × 35 + ½ = 18th data value.
When n is odd; data sorted in ascending order.

NumberArticle nameFlesch
18Thatcher’s Bag Rests in Peace50.6%
Broad sheet median
NumberArticle nameFlesch
18Intimidation and Violence in Penwith64.3%
Tabloid median

The interquartile range is also useful, in this instance it represents only the data found in the central half of the whole range, and shows the difference between a representative low and high value. The process of calculating the interquartile range for the readability ease of broad sheets and tabloids are shown below…

Broad sheet interquartile range: = Q3Q1
n + ½) – (¾n + ½)
(¼ × 35 + ½) – (¾ × 35 + ½)
27th – 9th data value

Tabloid interquartile range: = Q3Q1
n + ½) – (¾n + ½)
(¼ × 35 + ½) – (¾ × 35 + ½)
27th – 9th data value
Data sorted in ascending order.

NumberArticle nameFlesch
9Unions Could Join Bosses on Euro40.8%
27Doctors Leaving NHS56.3%
Broad sheet interquartile range
NumberArticle nameFlesch
9Bell’s Weird Farewell58.8%
27Prostate Tumor Removed71.0%
Tabloid interquartile range

The interquartile range shows similar results to the standard deviation. The range between the typical high and low values are greater with the broad sheets, the tabloids show a lowers interquartile range, showing that the data is more densely packed near the median, compared to broad sheets. The lower readability of broad sheets is less consistent than the higher readability of the tabloids, as shown earlier by the standard deviation.

Interpretation

There has been a very clear difference between broad sheet and tabloid newspapers. The readability ease of the broad sheets is much more difficult as longer sentences are used in the articles. Tabloids articles use less complicated language, resulting in a consistently higher Flesch readability ease, this is shown well by the two side-by-side histograms of Flesch readability ease. I anticipated this result as I have read many articles from both types of newspapers. From my own experience I have found tabloids quite easy to understand as the facts and views, in the article, are put forward simply, using short words and sentences. Broad sheet newspaper articles, I usually find more difficult to read, but there is often much more information packed into one sentence.

The standard deviation is an accurate measurement of which type of newspaper is aimed at a wider audience; there is a similar result for the broad sheets (9.11) and the tabloids (8.12). Broad sheets are aimed at a wider audience as it has a greater average spread from the mean. Looking at the real-life application of the standard deviation, one must consider the percentages of population which would have little trouble in reading lower readability newspapers. I do not have these figures, so I cannot conclude which type of paper is read by a greater population. I would assume tabloids as they can be easily understood by readers who understand shorter words and less complicated sentences as well as competent readers. Broad sheets, which are usually harder to understand, will normally be read only by readers who understand longer words and sentences.

The readability of the newspaper does determine who reads the paper, broad sheets are probably read by people who can easily understand low readability articles, and tabloids are often read by people who find it difficult to read articles with a low readability and read high readability articles found in tabloid newspapers. I believe this data was worth collecting, even if it took a long time to obtain from the internet, grammar check and sort in a database. It is essential in showing how broad sheet and tabloid newspapers have different readability levels. I believe that this data is a good representation of the parent population of newspapers in the UK and US as I have obtained Flesch readability ease data from a total of 70 individual newspaper articles. 70 pieces of data should be satisfactory to obtain patterns in data from which an accurate conclusion can be deduced.

Evaluation

Looking at the table of results I found one outlier article in the broad sheets’ readability: “Cheap Holiday? Take a Sheep”, it had almost 10 more on the Flesch scale than the previous article (sorted in ascending order). This outlier was hidden when the data was grouped into class sizes of 10. On reading the article I discovered that the article [Cheap Holiday? Take a Sheep] was written as a humorous and tabloid-style article, denoted by the heading. As I have a reasonable amount of data, this reduces the effect of this outlier on the sample population. There was no reason to remove the data; it is valid as it was found in a broad sheet newspaper.

There would have been a bias in the data without looking at the real-life application of the standard deviation I would say the data due to the material available on the internet. I found collecting broad sheet newspaper articles was not too difficult as there were many broad sheet papers in the category in Yahoo! (internet search engine). Finding tabloid articles was very difficult as there are very few in the UK and a small list available in the US. Broad sheet newspapers are very up-to-date with internet technology; almost every broad sheet sold in the UK has its own web site. Tabloids are less keen to build web sites and continually updating the site, when their newspapers are published. Besides The Mirror and local tabloids, there was a very short list for Yahoo! to choose from, when making random searches – resulting in repetitions of search results. When I included US tabloids there were just enough tabloids available to retrieve a varied sample of article from different newspapers. I must consider the effect of the US newspapers articles included in this investigation as US English is different from UK English and articles are written in a different style. Fortunately, Flesch readability ease is unbiased by language irregularities as it only takes into consideration: syllables per word and the average length of sentences, which is almost unaffected by word order, spelling etc. If I wanted to extend this work I would manually copy and grammar check more newspaper articles, trying to obtain an equal number of US newspapers for each newspaper type, eliminating the possibility of having a US/UK English bias in the Flesch reading ease results. The method of finding articles on the internet is also biased as some papers will be excluded from the list as they have no web site. The actual selection of articles to include from a newspaper once the web site is visited is quite inaccurate. The method used for data collection for this investigation has been to take the first 2 articles, the subject matter of which hasn’t been obtained elsewhere previously. An example a random, almost unbiased method of data collection would be:

  • Not use the internet to collect a limited set of data
  • Select all newspapers on a certain date
  • Have a numbered list all articles in newspapers
  • Use the random function on a calculator to select an equal sample of articles for each type of newspaper
  • Only to use UK national newspapers

This set up would involve a massive amount of tedious listing of all the broad sheet and tabloid newspapers in the UK on a certain date, and then copying of the randomly chosen articles for grammar checking. The parent population in this investigation would be of all the broad sheet and tabloid newspapers in the UK.

Comments

All comments are welcome, on this 1999 maths project. I created it in the early days of the web when few newspapers had a proper online presence.