>>> pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
>>> reviews=pd.read_csv('/mnt/d/wine.csv')
>>> reviews
show the contents of file wine.csv
>>> reviews.country
>>> reviews['country'] to see only country column
>>> reviews.iloc[0] return only one row
>>> review.iloc[5:10,[3,5]]]
>>> review.loc[review.country=='Spain']Out[16]:
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
---|
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
129961 | Italy |
---|
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
Note take care of Brackets like [(a==1) & (b==2) ]
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
Out[19]:
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
---|
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
---|
129970 | France | Big, rich and off-dry, this is powered by inte.. |
---|
Assigning data
Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:
Out[21]:
0 everyone
1 everyone
...
129969 everyone
129970 everyone
Make new column Disc = price*.9
review['disc']=review.price*.9
review[['price','disc']]
After doing modification in review file you may save it to another other file or same with command as under
>>> review.to_csv('my.csv') this will create new file name my.csv
Other overwrite file with same name
>>> review.to_csv('wine.csv')
df = pd.DataFrame({'country':['Italy','Portugal','US','US'],\
'provice':['Sicily & Sardina','Douro','California','New York'],\
'region_1':['Enta','NaN','Napa Valley','Finger Lakes'],\
'region_2':['NaN','NaN','Napa','Finger Lakes']},index=[0,1,10,100])
df =reviews.loc[[0,1,10,100],\
['country','province','region_1','region_2']]
| country | province | region_1 | region_2 |
---|
0 | Italy | Sicily & Sardinia | Etna | NaN |
---|
1 | Portugal | Douro | NaN | NaN |
---|
10 | US | California | Napa Valley | Napa |
---|
100 | US | New York | Finger Lakes | Finger Lakes |
---|
df = reviews.loc[0:99,['country','variety']]
>>> r.country.value_counts()
Summary Functions and Maps
>>> reviews.points.describe()
Maps
A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!
=================================
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns')
axis is either 0 or 1
1 means 'columns'
axis = 1 is same as axis='columns'
axis = 0 is ame as axis='rows'
row = each row of review
Always first parameter is taken from commad line like
review.points.apply(function)
Here the value of each points is row
row = points (value)
================
def func1(row):
row.country='PAKISTAN ZINDAHBAD'
row.variety='BAKWAS'
row.winery='HARAM FILL ISLAM'
return row
>>> reviews.apply(func1,axis='columns')
OR
>>>reviews.apply(func1, axis=1)
-----------------------Exersice No5--
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine
with the title of the wine with the highest points-to-price ratio in the dataset.
Hints to use idxmax() it will return record of Max value
>>> bargain_wine = reviews.loc[(reviews.points/reviews.price) .idxmax(), 'title']
In other simple way by using two lines
Solution:
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
-------------------------Ex 6
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts
counting how many times each of these two words appears in the description
column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)
Hint: Use a map to check each description for the string tropical
, then count up the number of times this is True
. Repeat this for fruity
. Finally, create a Series
combining the two values.
Solution:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
output
tropical 3607
fruity 9090
dtype: int64
-------------------------------Ex 7---------------------------------------------
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
Create a series star_ratings
with the number of stars corresponding to each review in the dataset.
Hint: Begin by writing a custom function that accepts a row from the DataFrame as input and returns the star rating corresponding to the row. Then, use DataFrame.apply to apply the custom function to every row in the dataset.
def stars(row):
if row.country == 'Canada':
return 3
if row.points >= 95:
return 3
if row.points >= 85:
return 2
return 1
star_ratings = reviews.apply(stars,axis=1)
axis=1 on columns and axis=0 means processed on ROWs
Grouping and Sorting
groupby() operation
points
80 397
81 692
...
99 33
100 19
Name: points, Length: 21, dtype: int64
-------------------------------------------
how we would pick out the best wine by country and province:
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:
reviews.groupby(['country']).price.agg([len, min, max])
len min max
country
Argentina 3800 4.0 230.0
Armenia 2 14.0 15.0
... ... ... ...
Ukraine 14 6.0 13.0
Uruguay 109 10.0 130.0
43 rows × 3 columns
Effective use of groupby() will allow you to do lots of really powerful things with your dataset.
Multiple Index
Renaming and Combining
>>> reviews.rename(columns={'points': 'score'})
country | description | designation | score | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
---|
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
>>> r.rename(columns={'points':'scrore','country':'mulk'})
>>> r.rename(index={0:'ONE',2:'THREE'})
Out[4]:
fields | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|
wines | | | | | | | | | | | | | |
---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN |
---|
pd.concat command to concate two files
>>> r=pd.read_csv('eg.csv')
>>> r1=pd.read_csv('eg2.csv')
>>> pd.concat([r,r2]
out put list r listing then with same colulum r1 listing
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
======================================
powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")
Both tables include references to a MeetID
, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.
Solution:
powerlifting_combined = powerlifting_meets.set_index("MeetID").join(powerlifting_competitors.set_index("MeetID"))
\
Comments
Post a Comment