Mathematics: Newspaper Readability

Introduction
Data
Aim
Method
Processing the data
Analysis
Interpretation
Evaluation
Comments

Introduction

This is an investigation into the readability of articles in certain types of newspapers: broad sheet, tabloid and local newspapers. I use the scales of readability, given after a grammar check using Microsoft Word.

I have used over 60 articles from newspapers in US and UK. The Flesch scale is given as this represents a sound scale of readability, the lower the percentage, the more experienced the reader needs to be to understand the article, a high percentage means basic, short words are frequently used making it understandable by a large number of people. Tabloids are noticeably easier to read and broad sheets are aimed at a people with higher language skills.

This is taken from the help in Microsoft Word on their grammar checking formulae:

The Flesh Index computes readability based on the average number of syllables per word and the average number of words per sentence. Scores range from 0% to 100%. The average writing score is approximately 60% to 70%. The higher the score, the greater the number of people who can readily understand the document.
Help in Microsoft Word ‘95

Data

Newspaper type	Name of newspaper	Article name	Flesch
Broad Sheet	BBC News	Hong Kong: UK – Dull But Tradeworthy	54.4%
Broad Sheet	BBC News	Pope Makes Jewish-Born Nun Saint	47.6%
Tabloid	Chicago Tribune	Stopgap Action Prevents Shutdown	43.9%
Tabloid	Cornishman	Intimidation and Violence in Penwith	64.3%
Tabloid	Cornishman	Mother Slams Drug Dealers	61.1%
Broad Sheet	Guardian	Twins Separated by No-Man’s Land	60.8%
Broad Sheet	Guardian	Bonds Plunge Fuels Turmoil	55.2%
Broad Sheet	Guardian	Pinochet Arrested In London	40.6%
Broad Sheet	Herald	NATO waits For Milosevic	36.9%
Broad Sheet	Herald	New Sunday Newspaper for Scotland	51.7%
Broad Sheet	Independent	Genetic Crops May Be Banned	44.6%
Broad Sheet	Independent	Doctors Leaving NHS	56.3%
Broad Sheet	Independent	Keep the Red Flag Flying	56.7%
Broad Sheet	Independent	Arts May Reap Millions	52.5%
Broad Sheet	Independent	Alcopop on Sale Next to Sweets	40.6%
Broad Sheet	Independent	Riots Rage in Hebron	64.8%
Broad Sheet	Independent on Sunday	China Selects The Blair	44.1%
Broad Sheet	London Evening Standard	Maxwell Faces Court Action	54.1%
Tabloid	Mirror	Fight to The Bitter End	55.1%
Tabloid	Mirror	Palace Must Match Our Good Sense	66.3%
Tabloid	Mirror	£33m Fine for The Sugar Price Fixers	65.2%
Tabloid	Mirror	Shearer For £18m	64.2%
Tabloid	Mirror on Sunday	Clare’s Clanger	69.2%
Tabloid	Mirror on Sunday	Secret Cancer Battle	68.4%
Broad Sheet	New York Post	Starr May Be Called to Testify	45.8%
Broad Sheet	Observer	Boosting Ranks of Black Police	54.8%
Broad Sheet	Telegraph	£0.5m Damages for Doctor	63.9%
Broad Sheet	Telegraph	Thatcher’s Bag Rests in Peace	50.6%
Broad Sheet	Telegraph	Shaw on Sex, Plum Cake, Morality	60.7%
Broad Sheet	Telegraph	Hollywood Wins Fight	40.2%
Broad Sheet	Telegraph	Cheap Holiday? Take a Sheep	74.1%
Broad Sheet	Telegraph	Playwright Of The Century	48.6%
Broad Sheet	Telegraph	Housing Benefit Fraud unit Closed	51.6%
Broad Sheet	Telegraph	Army’s Boycott of British Lamb	62.1%
Broad Sheet	Telegraph	Pensioners Get Help with the Fuel Bill	52.9%
Broad Sheet	Telegraph	God On a Web Site	39.9%
Broad Sheet	Telegraph	Beef on the Bone Ban is to be lifted	59.0%
Broad Sheet	Times	Serbia Gives Way to Avoid NATO	35.5%
Broad Sheet	Times	Neill wants £20m Cap on Funds	50.0%
Broad Sheet	Times	Foreigners Quit Belgrade	33.8%
Broad Sheet	Times	Leftist May Lead Italy	39.4%
Broad Sheet	Times	Unions Could Join Bosses on Euro	40.8%
Broad Sheet	Times	Tory Plot to Sink Archer Bid for Mayor	50.2%
Broad Sheet	Times	Balkans’ Secretive Sparring Partners	43.3%
Tabloid	US Tabloid	Man Breaks Wind for 30 Years	58.8%
Tabloid	US Tabloid	Contestant Kills Game Show Host	77.0%
Tabloid	US Tabloid	Faldo’s Birdie Tees Off on Porche	66.3%
Tabloid	US Tabloid	Alligator-Man Strikes	71.0%
Tabloid	US Tabloid	Oil Rules Review Vowed	59.8%
Tabloid	US Tabloid	Computer Viruses Infect Humans	55.5%
Tabloid	US Tabloid	One Arm Bandits	69.3%
Tabloid	US Tabloid	Prostate Tumor Removed	71.0%
Tabloid	US Tabloid	Santa Cruz Eats Too Healthy	75.6%
Tabloid	US Tabloid	Satan Joins the Meter Maids	71.0%
Tabloid	US Tabloid	Wolf-Calls from Gay Workers	60.3%
Tabloid	US Tabloid	Bell’s Weird Farewell	58.8%
Tabloid	US Tabloid	Rich Girl: 10 Most Wanted	61.0%
Tabloid	US Tabloid	Monica Talks	77.3%
Tabloid	US Tabloid	Woman Eaten by Parking Lot	72.5%
Tabloid	US Tabloid	Pop Guns	74.8%
Tabloid	US Tabloid	Bulls Gallop Back	60.9%
Tabloid	US Tabloid	Worst Fears Confirmed	58.1%
Tabloid	US Tabloid	Grisly Killing Stuns Friends	62.4%
Tabloid	US Tabloid	Delivery Three Weeks Early	72.2%
Tabloid	US Tabloid	Seinfeld Steals Wife	65.0%
Tabloid	US Tabloid	Dysfunctional Family Eats	50.3%
Tabloid	US Tabloid	Stinky New Hair Gel	64.%
Tabloid	US Tabloid	Naked Prof Calls It Art	47.3%
Tabloid	Washington Post	Bomber and Boss at a Loss	74.3%
Tabloid	Western Morning News	Crowds Are Given Rousing Send Off	55.2%

None of the articles are stored as files as they were only set temporarily to check for grammar and the article titles have been abbreviated. My categorising of tabloids and broad sheets may not be completely correct as newspapers do not always state which type they are. Some newspapers don’t fall into these categories, but I have placed them into two groups for the purpose of analysing with as few variables as possible.

Aim

The purpose of this investigation is to judge the readability of two types of newspapers: broad sheets and tabloids. I will collect at least 50 pieces of raw data for this investigation in order to answer some questions.

Which newspapers are read by a wider audience?
Does the difficulty of the language determine what type of people read the newspaper?

Newspapers with a low readability are aimed at people would can understand longer words and long sentences. Low readability newspaper articles usually contain words with many syllables.

High readability newspaper articles can be understood by a greater range of people with varying intellect. Shorter words, sentences are concise and contain words with few syllables, making it generally easier to comprehend. Article readability, I feel, is worthy of study as it will determine what audience the paper is aimed at, and whether or not broad sheets will clearly have a lower readability than tabloids newspapers, which could be aimed at a greater population. It will be interesting to understand how certain people like to read certain types of newspaper.

Method

Newspapers articles will be taken from a search of newspapers on the internet. In each newspaper, the headlines will be copied and analysed with the grammar check in MS Word. Results from the grammar check will be typed into a database, where the most representative readability formula will be chosen as the numeric variable for this investigation. The search engine Yahoo can be instructed to randomly find a tabloid and broad sheet newspapers on the internet. This made the newspaper site selection random. Yahoo categorises web sites depending on the material contained on them, broad sheets and tabloids are found under two different categories, making the process of finding different types of article easier. An equal number of articles of each type will be used, for this investigation I will use 35 for each type of newspaper, making a total 70 articles to be analysed for their readability. I will use the first two articles found on the online newspaper to be included for checking of readability, repeated article subject matter will not be included to ensure a greater variation of data.

The parent population from which I may choose to collect the data from, is undefined. There is no way of finding how many newspaper articles can come from either broad sheet or Tabloid newspapers. The number of newspaper articles, in UK or US and published on the internet, is the parent population in this investigation.

Processing the data

There was a large population to find articles for the broad sheets category, many UK national papers also have their own web site, where the paper can be read free of charge. Tabloids were more scarce and I had to use US tabloids in order to have a large enough population to choose from. The articles were copied to MS Word, where they were grammar checked, a large amount of the newspaper articles used are included in this investigation.

Data was stored and sorted using a database table in MS Access. There was a great amount of data obtained from the grammar check of each article, the vast majority was unnecessary, as it did not convey readability or the paper type. I have chosen the Flesch readability scale as this incorporates all the points mentioned in the Aim (word, sentence length and syllables per word). This numeric data is calculated out of 100, but all the data lies in the centre of the possible range from 33.8 to 77.3. Here is an explanation of each of the readability formulas found after a grammar check:

Flesch Reading Ease

This index computes readability based on the average number of syllables per word and the average number of words per sentence. Scores range from 0 to 100. The average writing score is approximately 60 to 70. The higher the score, the greater the number of people who can readily understand the document.

Flesch-Kincaid Grade Level

This index computes readability based on the average number of syllables per word and the average number of words per sentence. The score in this case indicates a US grade-school level. For example, a score of 8.0 means that an eighth grader would understand the document. Standard writing approximately equates to the seventh-to-eighth-grade level.

Coleman-Liau Grade Level

This index determines a readability grade level based on characters per word and words per sentences.

Bormuth Grade Level

This index also determines a readability grade level based on characters per word and words per sentences.
The reading ease of an article is the formula I want to use in comparing broad sheet and tabloid newspapers, so the Flesch Reading Ease will be used. Here is a table of 35 articles, for each type of newspaper, with their Flesch reading ease from lowest to highest in each category. This table will be used as a source for displays and analysis.

From the table, a frequency table was drawn up, with class sizes of 10, in order to represent data as a histogram. I have used class sizes of 10 because outliers are more likely to be incorporated in the main distribution of the histogram A stem and leaf diagram would not be a suitable method of displaying grouped data of the Flesch reading ease as there are decimals over a large range and having a leaf for each integer would spread the data out too far for any meaningful analysis. If I were to round data to the nearest integer, and then plot a stem and leaf diagram data values would be changed, possibly making findings less accurate. A histogram shows the spread of grouped data, but decimals have no effect on the group size.

The frequency density is the frequency divided by the class width (evenly spaced group sizes of 10).

Group	Frequency	x midpoint	Freq. density
30 ≤ Φ < 40	5	35	0.5
40 ≤ Φ < 50	10	45	1
50 ≤ Φ < 60	14	55	1.4
60 ≤ Φ < 70	5	65	0.5
70 ≤ Φ < 80	1	75	0.1

Broad sheet frequency density

Group	Frequency	x midpoint	Freq. density
30 ≤ Φ < 40	0	35	0
40 ≤ Φ < 50	2	45	0.2
50 ≤ Φ < 60	8	55	0.8
60 ≤ Φ < 70	15	65	1.5
70 ≤ Φ < 80	10	75	1

Tabloid frequency density

There appears to be a uni-modal distribution in both sets of data, there is only high frequency in each group and the frequency falls either side of the modal group. From the frequency table the following histograms have been produced, the y axis shows the frequency density and the x axis is the Flesch reading ease in groups of 10. Both histograms are identical in size and scale on both axis, and are opposite each other for easy comparison.

Broad Sheet Flesch Histogram — Broad sheet flesch histogram

Tabloid Flesch Histogram — Tabloid flesch histogram

Analysis

The distribution of the broad sheets histogram is positively skewed while the tabloids histogram is negatively skewed and are both uni-modal. The modal class of the broad sheets is: 50 ≤ Φ < 60 and the tabloids is the class: 60 ≤ Φ < 70. There are no outliers shown on either histogram. The histograms are quite evidently different, meaning that the two types of papers are aimed at different reading abilities. Tabloids, from looking at the histograms, are generally easier to read, while broad sheets are mainly aimed for people who can understand more complicated words and phrases.

The standard deviation of the article readability for each type of newspaper would be very useful for analysing the average spread of the data and deciding whether broad sheets are more consistently harder to read than tabloid newspapers. Standard deviation shows the average spread of the data from the mean. I have used class sizes of 5 to calculate the standard deviation as this is more accurate for this purpose, than the larger groups of 10, used for the histograms to show the shape of the distribution. The mean of broad sheets = 50.36 and tabloids = 64.5.

Broad sheet standard deviation = Standard Deviation equation
sqrt(2619.12 – 2536.13) = 9.11

Tabloid standard deviation = Standard Deviation equation
sqrt(4226.25 – 4160.25) = 8.12)

The standard deviation for both types of newspaper is quite similar, but the average spread of the data of the broad sheets’ readability is greater than the tabloids’. Therefore, the tabloids are more consistently easier to read than the broad sheets, which are slightly less consistently harder to read.

As both sets of data are skewed, the broads sheets: negatively skewed and tabloids: positively skewed, it would be appropriate to find the median and interquartile range. Median is a good personification of the measure of data as it is not effected by outliers (in this case uncommon values of readability ease), I am looking for the typical value for each newspaper type. The process of calculating the median for the readability ease of broad sheets and tabloids is shown below…

Broad sheet median = ½(n – 1)
½ × 35 + ½ = 18^th data value

Tabloid median = ½(n – 1)
½ × 35 + ½ = 18^th data value.
When n is odd; data sorted in ascending order.

Number	Article name	Flesch
18	Thatcher’s Bag Rests in Peace	50.6%

Broad sheet median

Number	Article name	Flesch
18	Intimidation and Violence in Penwith	64.3%

Tabloid median

The interquartile range is also useful, in this instance it represents only the data found in the central half of the whole range, and shows the difference between a representative low and high value. The process of calculating the interquartile range for the readability ease of broad sheets and tabloids are shown below…

Broad sheet interquartile range: = Q₃ – Q₁
(¼n + ½) – (¾n + ½)
(¼ × 35 + ½) – (¾ × 35 + ½)
27^th – 9^th data value

Tabloid interquartile range: = Q₃ – Q₁
(¼n + ½) – (¾n + ½)
(¼ × 35 + ½) – (¾ × 35 + ½)
27^th – 9^th data value
Data sorted in ascending order.

Number	Article name	Flesch
9	Unions Could Join Bosses on Euro	40.8%
27	Doctors Leaving NHS	56.3%

Broad sheet interquartile range

Number	Article name	Flesch
9	Bell’s Weird Farewell	58.8%
27	Prostate Tumor Removed	71.0%

Tabloid interquartile range

The interquartile range shows similar results to the standard deviation. The range between the typical high and low values are greater with the broad sheets, the tabloids show a lowers interquartile range, showing that the data is more densely packed near the median, compared to broad sheets. The lower readability of broad sheets is less consistent than the higher readability of the tabloids, as shown earlier by the standard deviation.

Interpretation

There has been a very clear difference between broad sheet and tabloid newspapers. The readability ease of the broad sheets is much more difficult as longer sentences are used in the articles. Tabloids articles use less complicated language, resulting in a consistently higher Flesch readability ease, this is shown well by the two side-by-side histograms of Flesch readability ease. I anticipated this result as I have read many articles from both types of newspapers. From my own experience I have found tabloids quite easy to understand as the facts and views, in the article, are put forward simply, using short words and sentences. Broad sheet newspaper articles, I usually find more difficult to read, but there is often much more information packed into one sentence.

The standard deviation is an accurate measurement of which type of newspaper is aimed at a wider audience; there is a similar result for the broad sheets (9.11) and the tabloids (8.12). Broad sheets are aimed at a wider audience as it has a greater average spread from the mean. Looking at the real-life application of the standard deviation, one must consider the percentages of population which would have little trouble in reading lower readability newspapers. I do not have these figures, so I cannot conclude which type of paper is read by a greater population. I would assume tabloids as they can be easily understood by readers who understand shorter words and less complicated sentences as well as competent readers. Broad sheets, which are usually harder to understand, will normally be read only by readers who understand longer words and sentences.

The readability of the newspaper does determine who reads the paper, broad sheets are probably read by people who can easily understand low readability articles, and tabloids are often read by people who find it difficult to read articles with a low readability and read high readability articles found in tabloid newspapers. I believe this data was worth collecting, even if it took a long time to obtain from the internet, grammar check and sort in a database. It is essential in showing how broad sheet and tabloid newspapers have different readability levels. I believe that this data is a good representation of the parent population of newspapers in the UK and US as I have obtained Flesch readability ease data from a total of 70 individual newspaper articles. 70 pieces of data should be satisfactory to obtain patterns in data from which an accurate conclusion can be deduced.

Evaluation

Looking at the table of results I found one outlier article in the broad sheets’ readability: “Cheap Holiday? Take a Sheep”, it had almost 10 more on the Flesch scale than the previous article (sorted in ascending order). This outlier was hidden when the data was grouped into class sizes of 10. On reading the article I discovered that the article [Cheap Holiday? Take a Sheep] was written as a humorous and tabloid-style article, denoted by the heading. As I have a reasonable amount of data, this reduces the effect of this outlier on the sample population. There was no reason to remove the data; it is valid as it was found in a broad sheet newspaper.

There would have been a bias in the data without looking at the real-life application of the standard deviation I would say the data due to the material available on the internet. I found collecting broad sheet newspaper articles was not too difficult as there were many broad sheet papers in the category in Yahoo! (internet search engine). Finding tabloid articles was very difficult as there are very few in the UK and a small list available in the US. Broad sheet newspapers are very up-to-date with internet technology; almost every broad sheet sold in the UK has its own web site. Tabloids are less keen to build web sites and continually updating the site, when their newspapers are published. Besides The Mirror and local tabloids, there was a very short list for Yahoo! to choose from, when making random searches – resulting in repetitions of search results. When I included US tabloids there were just enough tabloids available to retrieve a varied sample of article from different newspapers. I must consider the effect of the US newspapers articles included in this investigation as US English is different from UK English and articles are written in a different style. Fortunately, Flesch readability ease is unbiased by language irregularities as it only takes into consideration: syllables per word and the average length of sentences, which is almost unaffected by word order, spelling etc. If I wanted to extend this work I would manually copy and grammar check more newspaper articles, trying to obtain an equal number of US newspapers for each newspaper type, eliminating the possibility of having a US/UK English bias in the Flesch reading ease results. The method of finding articles on the internet is also biased as some papers will be excluded from the list as they have no web site. The actual selection of articles to include from a newspaper once the web site is visited is quite inaccurate. The method used for data collection for this investigation has been to take the first 2 articles, the subject matter of which hasn’t been obtained elsewhere previously. An example a random, almost unbiased method of data collection would be:

Not use the internet to collect a limited set of data
Select all newspapers on a certain date
Have a numbered list all articles in newspapers
Use the random function on a calculator to select an equal sample of articles for each type of newspaper
Only to use UK national newspapers

This set up would involve a massive amount of tedious listing of all the broad sheet and tabloid newspapers in the UK on a certain date, and then copying of the randomly chosen articles for grammar checking. The parent population in this investigation would be of all the broad sheet and tabloid newspapers in the UK.

Comments

All comments are welcome, on this 1999 maths project. I created it in the early days of the web when few newspapers had a proper online presence.