Mathematics: Newspaper Readability
This is an investigation into the readability of articles in certain types of newspapers: broad sheet, tabloid and local newspapers. I use the scales of readability, given after a grammar check using Microsoft Word.
I have used over 60 articles from newspapers in US and UK. The Flesch scale is given as this represents a sound scale of readability, the lower the percentage, the more experienced the reader needs to be to understand the article, a high percentage means basic, short words are frequently used making it understandable by a large number of people. Tabloids are noticeably easier to read and broad sheets are aimed at a people with higher language skills.
This is taken from the help in Microsoft Word on their grammar checking formulae:
The Flesh Index computes readability based on the average number of syllables per word and the average number of words per sentence. Scores range from 0% to 100%. The average writing score is approximately 60% to 70%. The higher the score, the greater the number of people who can readily understand the document.Help in Microsoft Word ‘95
|Newspaper type||Name of newspaper||Article name||Flesch|
|Broad Sheet||BBC News||Hong Kong: UK – Dull But Tradeworthy||54.4%|
|Broad Sheet||BBC News||Pope Makes Jewish-Born Nun Saint||47.6%|
|Tabloid||Chicago Tribune||Stopgap Action Prevents Shutdown||43.9%|
|Tabloid||Cornishman||Intimidation and Violence in Penwith||64.3%|
|Tabloid||Cornishman||Mother Slams Drug Dealers||61.1%|
|Broad Sheet||Guardian||Twins Separated by No-Man’s Land||60.8%|
|Broad Sheet||Guardian||Bonds Plunge Fuels Turmoil||55.2%|
|Broad Sheet||Guardian||Pinochet Arrested In London||40.6%|
|Broad Sheet||Herald||NATO waits For Milosevic||36.9%|
|Broad Sheet||Herald||New Sunday Newspaper for Scotland||51.7%|
|Broad Sheet||Independent||Genetic Crops May Be Banned||44.6%|
|Broad Sheet||Independent||Doctors Leaving NHS||56.3%|
|Broad Sheet||Independent||Keep the Red Flag Flying||56.7%|
|Broad Sheet||Independent||Arts May Reap Millions||52.5%|
|Broad Sheet||Independent||Alcopop on Sale Next to Sweets||40.6%|
|Broad Sheet||Independent||Riots Rage in Hebron||64.8%|
|Broad Sheet||Independent on Sunday||China Selects The Blair||44.1%|
|Broad Sheet||London Evening Standard||Maxwell Faces Court Action||54.1%|
|Tabloid||Mirror||Fight to The Bitter End||55.1%|
|Tabloid||Mirror||Palace Must Match Our Good Sense||66.3%|
|Tabloid||Mirror||£33m Fine for The Sugar Price Fixers||65.2%|
|Tabloid||Mirror||Shearer For £18m||64.2%|
|Tabloid||Mirror on Sunday||Clare’s Clanger||69.2%|
|Tabloid||Mirror on Sunday||Secret Cancer Battle||68.4%|
|Broad Sheet||New York Post||Starr May Be Called to Testify||45.8%|
|Broad Sheet||Observer||Boosting Ranks of Black Police||54.8%|
|Broad Sheet||Telegraph||£0.5m Damages for Doctor||63.9%|
|Broad Sheet||Telegraph||Thatcher’s Bag Rests in Peace||50.6%|
|Broad Sheet||Telegraph||Shaw on Sex, Plum Cake, Morality||60.7%|
|Broad Sheet||Telegraph||Hollywood Wins Fight||40.2%|
|Broad Sheet||Telegraph||Cheap Holiday? Take a Sheep||74.1%|
|Broad Sheet||Telegraph||Playwright Of The Century||48.6%|
|Broad Sheet||Telegraph||Housing Benefit Fraud unit Closed||51.6%|
|Broad Sheet||Telegraph||Army’s Boycott of British Lamb||62.1%|
|Broad Sheet||Telegraph||Pensioners Get Help with the Fuel Bill||52.9%|
|Broad Sheet||Telegraph||God On a Web Site||39.9%|
|Broad Sheet||Telegraph||Beef on the Bone Ban is to be lifted||59.0%|
|Broad Sheet||Times||Serbia Gives Way to Avoid NATO||35.5%|
|Broad Sheet||Times||Neill wants £20m Cap on Funds||50.0%|
|Broad Sheet||Times||Foreigners Quit Belgrade||33.8%|
|Broad Sheet||Times||Leftist May Lead Italy||39.4%|
|Broad Sheet||Times||Unions Could Join Bosses on Euro||40.8%|
|Broad Sheet||Times||Tory Plot to Sink Archer Bid for Mayor||50.2%|
|Broad Sheet||Times||Balkans’ Secretive Sparring Partners||43.3%|
|Tabloid||US Tabloid||Man Breaks Wind for 30 Years||58.8%|
|Tabloid||US Tabloid||Contestant Kills Game Show Host||77.0%|
|Tabloid||US Tabloid||Faldo’s Birdie Tees Off on Porche||66.3%|
|Tabloid||US Tabloid||Alligator-Man Strikes||71.0%|
|Tabloid||US Tabloid||Oil Rules Review Vowed||59.8%|
|Tabloid||US Tabloid||Computer Viruses Infect Humans||55.5%|
|Tabloid||US Tabloid||One Arm Bandits||69.3%|
|Tabloid||US Tabloid||Prostate Tumor Removed||71.0%|
|Tabloid||US Tabloid||Santa Cruz Eats Too Healthy||75.6%|
|Tabloid||US Tabloid||Satan Joins the Meter Maids||71.0%|
|Tabloid||US Tabloid||Wolf-Calls from Gay Workers||60.3%|
|Tabloid||US Tabloid||Bell’s Weird Farewell||58.8%|
|Tabloid||US Tabloid||Rich Girl: 10 Most Wanted||61.0%|
|Tabloid||US Tabloid||Monica Talks||77.3%|
|Tabloid||US Tabloid||Woman Eaten by Parking Lot||72.5%|
|Tabloid||US Tabloid||Pop Guns||74.8%|
|Tabloid||US Tabloid||Bulls Gallop Back||60.9%|
|Tabloid||US Tabloid||Worst Fears Confirmed||58.1%|
|Tabloid||US Tabloid||Grisly Killing Stuns Friends||62.4%|
|Tabloid||US Tabloid||Delivery Three Weeks Early||72.2%|
|Tabloid||US Tabloid||Seinfeld Steals Wife||65.0%|
|Tabloid||US Tabloid||Dysfunctional Family Eats||50.3%|
|Tabloid||US Tabloid||Stinky New Hair Gel||64.%|
|Tabloid||US Tabloid||Naked Prof Calls It Art||47.3%|
|Tabloid||Washington Post||Bomber and Boss at a Loss||74.3%|
|Tabloid||Western Morning News||Crowds Are Given Rousing Send Off||55.2%|
None of the articles are stored as files as they were only set temporarily to check for grammar and the article titles have been abbreviated. My categorising of tabloids and broad sheets may not be completely correct as newspapers do not always state which type they are. Some newspapers don’t fall into these categories, but I have placed them into two groups for the purpose of analysing with as few variables as possible.
The purpose of this investigation is to judge the readability of two types of newspapers: broad sheets and tabloids. I will collect at least 50 pieces of raw data for this investigation in order to answer some questions.
- Which newspapers are read by a wider audience?
- Does the difficulty of the language determine what type of people read the newspaper?
Newspapers with a low readability are aimed at people would can understand longer words and long sentences. Low readability newspaper articles usually contain words with many syllables.
High readability newspaper articles can be understood by a greater range of people with varying intellect. Shorter words, sentences are concise and contain words with few syllables, making it generally easier to comprehend. Article readability, I feel, is worthy of study as it will determine what audience the paper is aimed at, and whether or not broad sheets will clearly have a lower readability than tabloids newspapers, which could be aimed at a greater population. It will be interesting to understand how certain people like to read certain types of newspaper.
Newspapers articles will be taken from a search of newspapers on the internet. In each newspaper, the headlines will be copied and analysed with the grammar check in MS Word. Results from the grammar check will be typed into a database, where the most representative readability formula will be chosen as the numeric variable for this investigation. The search engine Yahoo can be instructed to randomly find a tabloid and broad sheet newspapers on the internet. This made the newspaper site selection random. Yahoo categorises web sites depending on the material contained on them, broad sheets and tabloids are found under two different categories, making the process of finding different types of article easier. An equal number of articles of each type will be used, for this investigation I will use 35 for each type of newspaper, making a total 70 articles to be analysed for their readability. I will use the first two articles found on the online newspaper to be included for checking of readability, repeated article subject matter will not be included to ensure a greater variation of data.
The parent population from which I may choose to collect the data from, is undefined. There is no way of finding how many newspaper articles can come from either broad sheet or Tabloid newspapers. The number of newspaper articles, in UK or US and published on the internet, is the parent population in this investigation.
Processing the data
There was a large population to find articles for the broad sheets category, many UK national papers also have their own web site, where the paper can be read free of charge. Tabloids were more scarce and I had to use US tabloids in order to have a large enough population to choose from. The articles were copied to MS Word, where they were grammar checked, a large amount of the newspaper articles used are included in this investigation.
Data was stored and sorted using a database table in MS Access. There was a great amount of data obtained from the grammar check of each article, the vast majority was unnecessary, as it did not convey readability or the paper type. I have chosen the Flesch readability scale as this incorporates all the points mentioned in the Aim (word, sentence length and syllables per word). This numeric data is calculated out of 100, but all the data lies in the centre of the possible range from 33.8 to 77.3. Here is an explanation of each of the readability formulas found after a grammar check:
Flesch Reading Ease
This index computes readability based on the average number of syllables per word and the average number of words per sentence. Scores range from 0 to 100. The average writing score is approximately 60 to 70. The higher the score, the greater the number of people who can readily understand the document.
Flesch-Kincaid Grade Level
This index computes readability based on the average number of syllables per word and the average number of words per sentence. The score in this case indicates a US grade-school level. For example, a score of 8.0 means that an eighth grader would understand the document. Standard writing approximately equates to the seventh-to-eighth-grade level.
Coleman-Liau Grade Level
This index determines a readability grade level based on characters per word and words per sentences.
Bormuth Grade Level
This index also determines a readability grade level based on characters per word and words per sentences.
The reading ease of an article is the formula I want to use in comparing broad sheet and tabloid newspapers, so the Flesch Reading Ease will be used. Here is a table of 35 articles, for each type of newspaper, with their Flesch reading ease from lowest to highest in each category. This table will be used as a source for displays and analysis.
From the table, a frequency table was drawn up, with class sizes of 10, in order to represent data as a histogram. I have used class sizes of 10 because outliers are more likely to be incorporated in the main distribution of the histogram A stem and leaf diagram would not be a suitable method of displaying grouped data of the Flesch reading ease as there are decimals over a large range and having a leaf for each integer would spread the data out too far for any meaningful analysis. If I were to round data to the nearest integer, and then plot a stem and leaf diagram data values would be changed, possibly making findings less accurate. A histogram shows the spread of grouped data, but decimals have no effect on the group size.
The frequency density is the frequency divided by the class width (evenly spaced group sizes of 10).
|Group||Frequency||x midpoint||Freq. density|
|30 ≤ Φ < 40||5||35||0.5|
|40 ≤ Φ < 50||10||45||1|
|50 ≤ Φ < 60||14||55||1.4|
|60 ≤ Φ < 70||5||65||0.5|
|70 ≤ Φ < 80||1||75||0.1|
|Group||Frequency||x midpoint||Freq. density|
|30 ≤ Φ < 40||0||35||0|
|40 ≤ Φ < 50||2||45||0.2|
|50 ≤ Φ < 60||8||55||0.8|
|60 ≤ Φ < 70||15||65||1.5|
|70 ≤ Φ < 80||10||75||1|
There appears to be a uni-modal distribution in both sets of data, there is only high frequency in each group and the frequency falls either side of the modal group. From the frequency table the following histograms have been produced, the y axis shows the frequency density and the x axis is the Flesch reading ease in groups of 10. Both histograms are identical in size and scale on both axis, and are opposite each other for easy comparison.
The distribution of the broad sheets histogram is positively skewed while the tabloids histogram is negatively skewed and are both uni-modal. The modal class of the broad sheets is: 50 ≤ Φ < 60 and the tabloids is the class: 60 ≤ Φ < 70. There are no outliers shown on either histogram. The histograms are quite evidently different, meaning that the two types of papers are aimed at different reading abilities. Tabloids, from looking at the histograms, are generally easier to read, while broad sheets are mainly aimed for people who can understand more complicated words and phrases.
The standard deviation of the article readability for each type of newspaper would be very useful for analysing the average spread of the data and deciding whether broad sheets are more consistently harder to read than tabloid newspapers. Standard deviation shows the average spread of the data from the mean. I have used class sizes of 5 to calculate the standard deviation as this is more accurate for this purpose, than the larger groups of 10, used for the histograms to show the shape of the distribution. The mean of broad sheets = 50.36 and tabloids = 64.5.
Broad sheet standard deviation =
Tabloid standard deviation =
The standard deviation for both types of newspaper is quite similar, but the average spread of the data of the broad sheets’ readability is greater than the tabloids’. Therefore, the tabloids are more consistently easier to read than the broad sheets, which are slightly less consistently harder to read.
As both sets of data are skewed, the broads sheets: negatively skewed and tabloids: positively skewed, it would be appropriate to find the median and interquartile range. Median is a good personification of the measure of data as it is not effected by outliers (in this case uncommon values of readability ease), I am looking for the typical value for each newspaper type. The process of calculating the median for the readability ease of broad sheets and tabloids is shown below…
Broad sheet median = ½(n – 1)
½ × 35 + ½ = 18th data value
Tabloid median = ½(n – 1)
½ × 35 + ½ = 18th data value.
When n is odd; data sorted in ascending order.
|18||Thatcher’s Bag Rests in Peace||50.6%|
|18||Intimidation and Violence in Penwith||64.3%|
The interquartile range is also useful, in this instance it represents only the data found in the central half of the whole range, and shows the difference between a representative low and high value. The process of calculating the interquartile range for the readability ease of broad sheets and tabloids are shown below…
Broad sheet interquartile range: = Q3 – Q1
(¼n + ½) – (¾n + ½)
(¼ × 35 + ½) – (¾ × 35 + ½)
27th – 9th data value
Tabloid interquartile range: = Q3 – Q1
(¼n + ½) – (¾n + ½)
(¼ × 35 + ½) – (¾ × 35 + ½)
27th – 9th data value
Data sorted in ascending order.
|9||Unions Could Join Bosses on Euro||40.8%|
|27||Doctors Leaving NHS||56.3%|
|9||Bell’s Weird Farewell||58.8%|
|27||Prostate Tumor Removed||71.0%|
The interquartile range shows similar results to the standard deviation. The range between the typical high and low values are greater with the broad sheets, the tabloids show a lowers interquartile range, showing that the data is more densely packed near the median, compared to broad sheets. The lower readability of broad sheets is less consistent than the higher readability of the tabloids, as shown earlier by the standard deviation.
There has been a very clear difference between broad sheet and tabloid newspapers. The readability ease of the broad sheets is much more difficult as longer sentences are used in the articles. Tabloids articles use less complicated language, resulting in a consistently higher Flesch readability ease, this is shown well by the two side-by-side histograms of Flesch readability ease. I anticipated this result as I have read many articles from both types of newspapers. From my own experience I have found tabloids quite easy to understand as the facts and views, in the article, are put forward simply, using short words and sentences. Broad sheet newspaper articles, I usually find more difficult to read, but there is often much more information packed into one sentence.
The standard deviation is an accurate measurement of which type of newspaper is aimed at a wider audience; there is a similar result for the broad sheets (9.11) and the tabloids (8.12). Broad sheets are aimed at a wider audience as it has a greater average spread from the mean. Looking at the real-life application of the standard deviation, one must consider the percentages of population which would have little trouble in reading lower readability newspapers. I do not have these figures, so I cannot conclude which type of paper is read by a greater population. I would assume tabloids as they can be easily understood by readers who understand shorter words and less complicated sentences as well as competent readers. Broad sheets, which are usually harder to understand, will normally be read only by readers who understand longer words and sentences.
The readability of the newspaper does determine who reads the paper, broad sheets are probably read by people who can easily understand low readability articles, and tabloids are often read by people who find it difficult to read articles with a low readability and read high readability articles found in tabloid newspapers. I believe this data was worth collecting, even if it took a long time to obtain from the internet, grammar check and sort in a database. It is essential in showing how broad sheet and tabloid newspapers have different readability levels. I believe that this data is a good representation of the parent population of newspapers in the UK and US as I have obtained Flesch readability ease data from a total of 70 individual newspaper articles. 70 pieces of data should be satisfactory to obtain patterns in data from which an accurate conclusion can be deduced.
Looking at the table of results I found one outlier article in the broad sheets’ readability: “Cheap Holiday? Take a Sheep”, it had almost 10 more on the Flesch scale than the previous article (sorted in ascending order). This outlier was hidden when the data was grouped into class sizes of 10. On reading the article I discovered that the article [Cheap Holiday? Take a Sheep] was written as a humorous and tabloid-style article, denoted by the heading. As I have a reasonable amount of data, this reduces the effect of this outlier on the sample population. There was no reason to remove the data; it is valid as it was found in a broad sheet newspaper.
There would have been a bias in the data without looking at the real-life application of the standard deviation I would say the data due to the material available on the internet. I found collecting broad sheet newspaper articles was not too difficult as there were many broad sheet papers in the category in Yahoo! (internet search engine). Finding tabloid articles was very difficult as there are very few in the UK and a small list available in the US. Broad sheet newspapers are very up-to-date with internet technology; almost every broad sheet sold in the UK has its own web site. Tabloids are less keen to build web sites and continually updating the site, when their newspapers are published. Besides The Mirror and local tabloids, there was a very short list for Yahoo! to choose from, when making random searches – resulting in repetitions of search results. When I included US tabloids there were just enough tabloids available to retrieve a varied sample of article from different newspapers. I must consider the effect of the US newspapers articles included in this investigation as US English is different from UK English and articles are written in a different style. Fortunately, Flesch readability ease is unbiased by language irregularities as it only takes into consideration: syllables per word and the average length of sentences, which is almost unaffected by word order, spelling etc. If I wanted to extend this work I would manually copy and grammar check more newspaper articles, trying to obtain an equal number of US newspapers for each newspaper type, eliminating the possibility of having a US/UK English bias in the Flesch reading ease results. The method of finding articles on the internet is also biased as some papers will be excluded from the list as they have no web site. The actual selection of articles to include from a newspaper once the web site is visited is quite inaccurate. The method used for data collection for this investigation has been to take the first 2 articles, the subject matter of which hasn’t been obtained elsewhere previously. An example a random, almost unbiased method of data collection would be:
- Not use the internet to collect a limited set of data
- Select all newspapers on a certain date
- Have a numbered list all articles in newspapers
- Use the random function on a calculator to select an equal sample of articles for each type of newspaper
- Only to use UK national newspapers
This set up would involve a massive amount of tedious listing of all the broad sheet and tabloid newspapers in the UK on a certain date, and then copying of the randomly chosen articles for grammar checking. The parent population in this investigation would be of all the broad sheet and tabloid newspapers in the UK.
All comments are welcome, on this 1999 maths project. I created it in the early days of the web when few newspapers had a proper online presence.