Jak pandy analyzují data

Na (malinko změněném) převzatém příkladu si ukážeme, jako pomocí Pandas spolu s Matplotlib a ggplot analyzovat data. Pandas je z velké části inspirován R-kem, konkrétně data.frame třídou, která se v Pandas nazývá DataFrame. Jedná se o 2D tabulární strukturu, podobnou relační databázi (SQL) nebo tabulce Excelu. Pandas využívá NumPy pro většinu výpočtů, které jsou tím pádem velice rychlé, a zároveň umožňuje velice flexibilní manipulaci s daty.

A Rubric for Data Wrangling and Exploration

Companion to Lecture 4 of Harvard CS109: Data Science | Prepared by Chris Beaumont

This scene from Cast Away is an accurate metaphor for the amount of time you'll spend cleaning data, and the delirium you'll experience at the end.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#tell pandas to display wide tables as pretty HTML tables
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

I'd like to suggest a basic rubric for the early stages of exploratory data analysis in Python. This isn't universally applicable, but it does cover many patterns which recur in several data analysis contexts. It's useful to keep this rubric in mind when encountering a new dataset.

The basic workflow is as follows:

  1. Build a DataFrame from the data (ideally, put all data in this object)
  2. Clean the DataFrame. It should have the following properties:
    • Each row describes a single object
    • Each column describes a property of that object
    • Columns are numeric whenever appropriate
    • Columns contain atomic properties that cannot be further decomposed
  3. Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
  4. Explore group properties. Use groupby and small multiples to compare subsets of the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.

Here's a preview of the raw data we'll use -- it's a list of the 10,000 movies made since 1950 with the most IMDB user ratings. It was scraped about a year ago from pages like this. Download the data at http://bit.ly/cs109_imdb.

In [2]:
!head imdb_top_10000.txt
tt0111161	The Shawshank Redemption (1994)	1994	 9.2	619479	142 mins.	Crime|Drama
tt0110912	Pulp Fiction (1994)	1994	 9.0	490065	154 mins.	Crime|Thriller
tt0137523	Fight Club (1999)	1999	 8.8	458173	139 mins.	Drama|Mystery|Thriller
tt0133093	The Matrix (1999)	1999	 8.7	448114	136 mins.	Action|Adventure|Sci-Fi
tt1375666	Inception (2010)	2010	 8.9	385149	148 mins.	Action|Adventure|Sci-Fi|Thriller
tt0109830	Forrest Gump (1994)	1994	 8.7	368994	142 mins.	Comedy|Drama|Romance
tt0169547	American Beauty (1999)	1999	 8.6	338332	122 mins.	Drama
tt0499549	Avatar (2009)	2009	 8.1	336855	162 mins.	Action|Adventure|Fantasy|Sci-Fi
tt0108052	Schindler's List (1993)	1993	 8.9	325888	195 mins.	Biography|Drama|History|War
tt0080684	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	 8.8	320105	124 mins.	Action|Adventure|Family|Sci-Fi

1. Build a DataFrame

The textfile is tab-separated, and doesn't have any column headers. We set the appropriate keywords in pd.read_csv to handle this

In [4]:
names = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
data = pd.read_csv('imdb_top_10000.txt', delimiter='\t', names=names).dropna()
print("Number of rows: %i" % data.shape[0])
data.head()  # print the first 5 rows
Number of rows: 9999
Out[4]:
imdbID title year score votes runtime genres
0 tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
1 tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
2 tt0137523 Fight Club (1999) 1999 8.8 458173 139 mins. Drama|Mystery|Thriller
3 tt0133093 The Matrix (1999) 1999 8.7 448114 136 mins. Action|Adventure|Sci-Fi
4 tt1375666 Inception (2010) 2010 8.9 385149 148 mins. Action|Adventure|Sci-Fi|Thriller

2. Clean the DataFrame

There are several problems with the DataFrame at this point:

  1. The runtime column describes a number, but is stored as a string
  2. The genres column is not atomic -- it aggregates several genres together. This makes it hard, for example, to extract which movies are Comedies.
  3. The movie year is repeated in the title and year column

Fixing the runtime column

The following snipptet converts a string like '142 mins.' to the number 142:

In [6]:
dirty = '142 mins.'
number, text = dirty.split(' ')
clean = int(number)
print(number)
142

We can package this up into a list comprehension

In [7]:
clean_runtime = [float(r.split(' ')[0]) for r in data.runtime]
data['runtime'] = clean_runtime
data.head()
Out[7]:
imdbID title year score votes runtime genres
0 tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 Crime|Drama
1 tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 Crime|Thriller
2 tt0137523 Fight Club (1999) 1999 8.8 458173 139 Drama|Mystery|Thriller
3 tt0133093 The Matrix (1999) 1999 8.7 448114 136 Action|Adventure|Sci-Fi
4 tt1375666 Inception (2010) 2010 8.9 385149 148 Action|Adventure|Sci-Fi|Thriller

Splitting up the genres

We can use the concept of indicator variables to split the genres column into many columns. Each new column will correspond to a single genre, and each cell will be True or False.

In [8]:
#determine the unique genres
genres = set()
for m in data.genres:
    genres.update(g for g in m.split('|'))
genres = sorted(genres)

#make a column for each genre
for genre in genres:
    data[genre] = [genre in movie.split('|') for movie in data.genres]
         
data.head()
Out[8]:
imdbID title year score votes runtime genres Action Adult Adventure Animation Biography Comedy Crime Drama Family Fantasy Film-Noir History Horror Music Musical Mystery News Reality-TV Romance Sci-Fi Sport Thriller War Western
0 tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 Crime|Drama False False False False False False True True False False False False False False False False False False False False False False False False
1 tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 Crime|Thriller False False False False False False True False False False False False False False False False False False False False False True False False
2 tt0137523 Fight Club (1999) 1999 8.8 458173 139 Drama|Mystery|Thriller False False False False False False False True False False False False False False False True False False False False False True False False
3 tt0133093 The Matrix (1999) 1999 8.7 448114 136 Action|Adventure|Sci-Fi True False True False False False False False False False False False False False False False False False False True False False False False
4 tt1375666 Inception (2010) 2010 8.9 385149 148 Action|Adventure|Sci-Fi|Thriller True False True False False False False False False False False False False False False False False False False True False True False False

Removing year from the title

We can fix each element by stripping off the last 7 characters

In [9]:
data['title'] = [t[0:-7] for t in data.title]
data.head()
Out[9]:
imdbID title year score votes runtime genres Action Adult Adventure Animation Biography Comedy Crime Drama Family Fantasy Film-Noir History Horror Music Musical Mystery News Reality-TV Romance Sci-Fi Sport Thriller War Western
0 tt0111161 The Shawshank Redemption 1994 9.2 619479 142 Crime|Drama False False False False False False True True False False False False False False False False False False False False False False False False
1 tt0110912 Pulp Fiction 1994 9.0 490065 154 Crime|Thriller False False False False False False True False False False False False False False False False False False False False False True False False
2 tt0137523 Fight Club 1999 8.8 458173 139 Drama|Mystery|Thriller False False False False False False False True False False False False False False False True False False False False False True False False
3 tt0133093 The Matrix 1999 8.7 448114 136 Action|Adventure|Sci-Fi True False True False False False False False False False False False False False False False False False False True False False False False
4 tt1375666 Inception 2010 8.9 385149 148 Action|Adventure|Sci-Fi|Thriller True False True False False False False False False False False False False False False False False False False True False True False False

3. Explore global properties

Next, we get a handle on some basic, global summaries of the DataFrame.

Call describe on relevant columns

In [10]:
data[['score', 'runtime', 'year', 'votes']].describe()
Out[10]:
score runtime year votes
count 9999.000000 9999.000000 9999.000000 9999.000000
mean 6.385989 103.580358 1993.471447 16605.462946
std 1.189965 26.629310 14.830049 34564.883945
min 1.500000 0.000000 1950.000000 1356.000000
25% 5.700000 93.000000 1986.000000 2334.500000
50% 6.600000 102.000000 1998.000000 4981.000000
75% 7.200000 115.000000 2005.000000 15278.500000
max 9.200000 450.000000 2011.000000 619479.000000
In [12]:
#hmmm, a runtime of 0 looks suspicious. How many movies have that?
print(len(data[data.runtime == 0]))

#probably best to flag those bad data as NAN
data.runtime[data.runtime==0] = np.nan
282
/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/IPython/kernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

After flagging bad runtimes, we repeat

In [13]:
data.runtime.describe()
Out[13]:
count    9717.000000
mean      106.586395
std        20.230330
min        45.000000
25%        93.000000
50%       103.000000
75%       115.000000
max       450.000000
Name: runtime, dtype: float64

Make some basic plots

In [14]:
# more movies in recent years, but not *very* recent movies (they haven't had time to receive lots of votes yet?)
plt.hist(data.year, bins=np.arange(1950, 2013), color='#cccccc')
plt.xlabel("Release Year")
remove_border()

The same with the ggplot package.

In [15]:
import ggplot
In [16]:
p = ggplot.ggplot(ggplot.aes(x='year'), data=data)
p + ggplot.geom_histogram(binwidth=1) + ggplot.ggtitle("Movies per year histogram")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/IPython/core/formatters.py in __call__(self, obj)
    688                 type_pprinters=self.type_printers,
    689                 deferred_pprinters=self.deferred_printers)
--> 690             printer.pretty(obj)
    691             printer.flush()
    692             return stream.getvalue()

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    407                             if callable(meth):
    408                                 return meth(obj, self, cycle)
--> 409             return _default_pprint(obj, self, cycle)
    410         finally:
    411             self.end_group()

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
    527     if _safe_getattr(klass, '__repr__', None) not in _baseclass_reprs:
    528         # A user-provided repr. Find newlines and replace them with p.break_()
--> 529         _repr_pprint(obj, p, cycle)
    530         return
    531     p.begin_group(1, '<')

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    709     """A pprint that just redirects to the normal repr function."""
    710     # Find newlines and replace them with p.break_()
--> 711     output = repr(obj)
    712     for idx,output_line in enumerate(output.splitlines()):
    713         if idx:

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/ggplot/ggplot.py in __repr__(self)
    109     def __repr__(self):
    110         """Print/show the plot"""
--> 111         figure = self.draw()
    112         # We're going to default to making the plot appear when __repr__ is
    113         # called.

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/ggplot/ggplot.py in draw(self)
    305 
    306                     data = self._make_plot_data(data, _aes)
--> 307                     callbacks = geom.plot_layer(data, ax)
    308                     if callbacks:
    309                         for callback in callbacks:

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/ggplot/geoms/geom.py in plot_layer(self, data, ax)
    115         _cols = set(data.columns) & set(self.manual_aes)
    116         data = data.drop(_cols, axis=1)
--> 117         data = self._calculate_stats(data)
    118         self._verify_aesthetics(data)
    119         _needed = self.valid_aes | self._extra_requires

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/ggplot/geoms/geom.py in _calculate_stats(self, data)
    276                 new_data = new_data.append(_data, ignore_index=True)
    277         else:
--> 278             new_data = self._stat._calculate(data)
    279 
    280         return new_data

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/ggplot/stats/stat_bin.py in _calculate(self, data)
    125                             })
    126         _wfreq_table = pd.pivot_table(_df, values='weights',
--> 127                                       rows=['assignments'], aggfunc=np.sum)
    128 
    129         # For numerical x values, empty bins get have no value

TypeError: pivot_table() got an unexpected keyword argument 'rows'
In [17]:
plt.hist(data.score, bins=20, color='#cccccc')
plt.xlabel("IMDB rating")
remove_border()
In [18]:
plt.hist(data.runtime.dropna(), bins=50, color='#cccccc')
plt.xlabel("Runtime distribution")
remove_border()
In [19]:
#hmm, more bad, recent movies. Real, or a selection bias?

plt.scatter(data.year, data.score, lw=0, alpha=.08, color='k')
plt.xlabel("Year")
plt.ylabel("IMDB Rating")
remove_border()
In [20]:
plt.scatter(data.votes, data.score, lw=0, alpha=.2, color='k')
plt.xlabel("Number of Votes")
plt.ylabel("IMDB Rating")
plt.xscale('log')
remove_border()

Identify some outliers

In [21]:
# low-score movies with lots of votes
data[(data.votes > 9e4) & (data.score < 5)][['title', 'year', 'score', 'votes', 'genres']]
Out[21]:
title year score votes genres
317 New Moon 2009 4.5 90457 Adventure|Drama|Fantasy|Romance
334 Batman & Robin 1997 3.5 91875 Action|Crime|Fantasy|Sci-Fi
In [22]:
# The lowest rated movies
data[data.score == data.score.min()][['title', 'year', 'score', 'votes', 'genres']]
Out[22]:
title year score votes genres
1982 Manos: The Hands of Fate 1966 1.5 20927 Horror
2793 Superbabies: Baby Geniuses 2 2004 1.5 13196 Comedy|Family
3746 Daniel the Wizard 2004 1.5 8271 Comedy|Crime|Family|Fantasy|Horror
5158 Ben & Arthur 2002 1.5 4675 Drama|Romance
5993 Night Train to Mundo Fine 1966 1.5 3542 Action|Adventure|Crime|War
6257 Monster a-Go Go 1965 1.5 3255 Sci-Fi|Horror
6726 Dream Well 2009 1.5 2848 Comedy|Romance|Sport
In [23]:
# The highest rated movies
data[data.score == data.score.max()][['title', 'year', 'score', 'votes', 'genres']]
Out[23]:
title year score votes genres
0 The Shawshank Redemption 1994 9.2 619479 Crime|Drama
26 The Godfather 1972 9.2 474189 Crime|Drama

Run aggregation functions like sum over several rows or columns

What genres are the most frequent?

In [24]:
#sum sums over rows by default
genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})
Out[24]:
Genre Count
0 5697
1 3922
2 2832
3 2441
4 1891
5 1867
6 1313
7 1215
8 1009
9 916
10 897
11 754
12 512
13 394
14 371
15 358
16 314
17 288
18 260
19 235
20 40
21 9
22 1
23 1

How many genres does a movie have, on average?

In [26]:
#axis=1 sums over columns instead
genre_count = data[genres].sum(axis=1) 
print("Average movie has %0.2f genres" % genre_count.mean())
genre_count.describe()
Average movie has 2.75 genres
Out[26]:
count    9999.000000
mean        2.753975
std         1.168910
min         1.000000
25%         2.000000
50%         3.000000
75%         3.000000
max         8.000000
dtype: float64

Explore Group Properties

Let's split up movies by decade

In [27]:
decade =  (data.year // 10) * 10

tyd = data[['title', 'year']]
tyd['decade'] = decade

tyd.head()
/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[27]:
title year decade
0 The Shawshank Redemption 1994 1990
1 Pulp Fiction 1994 1990
2 Fight Club 1999 1990
3 The Matrix 1999 1990
4 Inception 2010 2010

GroupBy will gather movies into groups with equal decade values

In [29]:
#mean score for all movies in each decade
decade_mean = data.groupby(decade).score.mean()
decade_mean.name = 'Decade Mean'
print(decade_mean)

plt.plot(decade_mean.index, decade_mean.values, 'o-',
        color='r', lw=3, label='Decade Average')
plt.scatter(data.year, data.score, alpha=.04, lw=0, color='k')
plt.xlabel("Year")
plt.ylabel("Score")
plt.legend(frameon=False)
remove_border()
year
1950    7.244522
1960    7.062367
1970    6.842297
1980    6.248693
1990    6.199316
2000    6.277858
2010    6.344552
Name: Decade Mean, dtype: float64

We can go one further, and compute the scatter in each year as well

In [30]:
grouped_scores = data.groupby(decade).score

mean = grouped_scores.mean()
std = grouped_scores.std()

plt.plot(decade_mean.index, decade_mean.values, 'o-',
        color='r', lw=3, label='Decade Average')
plt.fill_between(decade_mean.index, (decade_mean + std).values,
                 (decade_mean - std).values, color='r', alpha=.2)
plt.scatter(data.year, data.score, alpha=.04, lw=0, color='k')
plt.xlabel("Year")
plt.ylabel("Score")
plt.legend(frameon=False)
remove_border()

You can also iterate over a GroupBy object. Each iteration yields two variables: one of the distinct values of the group key, and the subset of the dataframe where the key equals that value. To find the most popular movie each year:

In [32]:
for year, subset in data.groupby('year'):
    print(year, subset[subset.score == subset.score.max()].title.values)
1950 ['Sunset Blvd.']
1951 ['Strangers on a Train']
1952 ["Singin' in the Rain"]
1953 ['The Wages of Fear' 'Tokyo Story']
1954 ['Seven Samurai']
1955 ['Diabolique']
1956 ['The Killing']
1957 ['12 Angry Men']
1958 ['Vertigo']
1959 ['North by Northwest']
1960 ['Psycho']
1961 ['Yojimbo']
1962 ['To Kill a Mockingbird' 'Lawrence of Arabia']
1963 ['The Great Escape' 'High and Low']
1964 ['Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb']
1965 ['For a Few Dollars More']
1966 ['The Good, the Bad and the Ugly']
1967 ['Cool Hand Luke']
1968 ['Once Upon a Time in the West']
1969 ['Butch Cassidy and the Sundance Kid' 'Army of Shadows']
1970 ['Patton' 'The Conformist' 'Le Cercle Rouge']
1971 ['A Clockwork Orange']
1972 ['The Godfather']
1973 ['The Sting' 'Scenes from a Marriage']
1974 ['The Godfather: Part II']
1975 ['Outrageous Class']
1976 ['Tosun Pasa']
1977 ['Star Wars: Episode IV - A New Hope']
1978 ['The Girl with the Red Scarf']
1979 ['Apocalypse Now']
1980 ['Star Wars: Episode V - The Empire Strikes Back']
1981 ['Raiders of the Lost Ark']
1982 ['The Marathon Family']
1983 ['Star Wars: Episode VI - Return of the Jedi']
1984 ['Balkan Spy']
1985 ['The Broken Landlord']
1986 ['Aliens']
1987 ['Mr. Muhsin']
1988 ['Cinema Paradiso']
1989 ['Indiana Jones and the Last Crusade' "Don't Let Them Shoot the Kite"]
1990 ['Goodfellas']
1991 ['The Silence of the Lambs']
1992 ['Reservoir Dogs']
1993 ["Schindler's List"]
1994 ['The Shawshank Redemption']
1995 ['The Usual Suspects' 'Se7en']
1996 ['Fargo' 'The Bandit']
1997 ['Life Is Beautiful']
1998 ['American History X']
1999 ['Fight Club']
2000 ['Memento']
2001 ['The Lord of the Rings: The Fellowship of the Ring']
2002 ['City of God']
2003 ['The Lord of the Rings: The Return of the King']
2004 ['Eternal Sunshine of the Spotless Mind']
2005 ['My Father and My Son']
2006 ['The Departed' 'The Lives of Others']
2007 ['Like Stars on Earth']
2008 ['The Dark Knight']
2009 ['Inglourious Basterds']
2010 ['Inception']
2011 ['A Separation']

Small multiples

Let's split up the movies by genre, and look at how their release year/runtime/IMDB score vary. The distribution for all movies is shown as a grey background.

This isn't a standard groupby, so we can't use the groupby method here. A manual loop is needed

In [33]:
#create a 4x6 grid of plots.
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), 
                         tight_layout=True)

bins = np.arange(1950, 2013, 3)
for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].year, 
            bins=bins, histtype='stepfilled', normed=True, color='r', alpha=.3, ec='none')
    ax.hist(data.year, bins=bins, histtype='stepfilled', ec='None', normed=True, zorder=0, color='#cccccc')
    
    ax.annotate(genre, xy=(1955, 3e-2), fontsize=14)
    ax.xaxis.set_ticks(np.arange(1950, 2013, 30))
    ax.set_yticks([])
    remove_border(ax, left=False)
    ax.set_xlabel('Year')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-1e370d64c612> in <module>()
      6 for ax, genre in zip(axes.ravel(), genres):
      7     ax.hist(data[data[genre] == 1].year, 
----> 8             bins=bins, histtype='stepfilled', normed=True, color='r', alpha=.3, ec='none')
      9     ax.hist(data.year, bins=bins, histtype='stepfilled', ec='None', normed=True, zorder=0, color='#cccccc')
     10 

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/matplotlib/axes/_axes.py in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   5602         # Massage 'x' for processing.
   5603         # NOTE: Be sure any changes here is also done below to 'weights'
-> 5604         if isinstance(x, np.ndarray) or not iterable(x[0]):
   5605             # TODO: support masked arrays;
   5606             x = np.asarray(x)

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/pandas/core/series.py in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/pandas/core/index.py in get_value(self, series, key)
   1593 
   1594         try:
-> 1595             return self._engine.get_value(s, k)
   1596         except KeyError as e1:
   1597             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3113)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2844)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7224)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7162)()

KeyError: 0
/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/matplotlib/figure.py:1653: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "

Some subtler patterns here:

  1. Westerns and Musicals have a more level distribution
  2. Film Noir movies were much more popular in the 50s and 60s
In [34]:
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), tight_layout=True)

bins = np.arange(30, 240, 10)

for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].runtime, 
            bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
               
    ax.hist(data.runtime, bins=bins, normed=True,
            histtype='stepfilled', ec='none', color='#cccccc',
            zorder=0)
    
    ax.set_xticks(np.arange(30, 240, 60))
    ax.set_yticks([])
    ax.set_xlabel("Runtime [min]")
    remove_border(ax, left=False)
    ax.annotate(genre, xy=(230, .02), ha='right', fontsize=12)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-34-a6449b679551> in <module>()
      5 for ax, genre in zip(axes.ravel(), genres):
      6     ax.hist(data[data[genre] == 1].runtime, 
----> 7             bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
      8 
      9     ax.hist(data.runtime, bins=bins, normed=True,

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/matplotlib/axes/_axes.py in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   5602         # Massage 'x' for processing.
   5603         # NOTE: Be sure any changes here is also done below to 'weights'
-> 5604         if isinstance(x, np.ndarray) or not iterable(x[0]):
   5605             # TODO: support masked arrays;
   5606             x = np.asarray(x)

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/pandas/core/series.py in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/pandas/core/index.py in get_value(self, series, key)
   1593 
   1594         try:
-> 1595             return self._engine.get_value(s, k)
   1596         except KeyError as e1:
   1597             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3113)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2844)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7224)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7162)()

KeyError: 0
/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/matplotlib/figure.py:1653: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "
  1. Biographies and history movies are longer
  2. Animated movies are shorter
  3. Film-Noir movies have the same mean, but are more conentrated around a 100 minute runtime
  4. Musicals have the same mean, but greater dispersion in runtimes
In [35]:
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(12, 8), tight_layout=True)

bins = np.arange(0, 10, .5)

for ax, genre in zip(axes.ravel(), genres):
    ax.hist(data[data[genre] == 1].score, 
            bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
               
    ax.hist(data.score, bins=bins, normed=True,
            histtype='stepfilled', ec='none', color='#cccccc',
            zorder=0)
    
    ax.set_yticks([])
    ax.set_xlabel("Score")
    remove_border(ax, left=False)
    ax.set_ylim(0, .4)
    ax.annotate(genre, xy=(0, .2), ha='left', fontsize=12)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-35-5f241d1df81d> in <module>()
      5 for ax, genre in zip(axes.ravel(), genres):
      6     ax.hist(data[data[genre] == 1].score, 
----> 7             bins=bins, histtype='stepfilled', color='r', ec='none', alpha=.3, normed=True)
      8 
      9     ax.hist(data.score, bins=bins, normed=True,

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/matplotlib/axes/_axes.py in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   5602         # Massage 'x' for processing.
   5603         # NOTE: Be sure any changes here is also done below to 'weights'
-> 5604         if isinstance(x, np.ndarray) or not iterable(x[0]):
   5605             # TODO: support masked arrays;
   5606             x = np.asarray(x)

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/pandas/core/series.py in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/pandas/core/index.py in get_value(self, series, key)
   1593 
   1594         try:
-> 1595             return self._engine.get_value(s, k)
   1596         except KeyError as e1:
   1597             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3113)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2844)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7224)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7162)()

KeyError: 0
/sw/python2/anaconda3/envs/python_course/lib/python3.4/site-packages/matplotlib/figure.py:1653: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "
  1. Film-noirs, histories, and biographies have higher ratings (a selection effect?)
  2. Horror movies and adult films have lower ratings

Other Resources

css tweaks in this cell

Komentáře

Comments powered by Disqus