1  rails g model personal_info height:float weight:float person:references 
db/migrate/_create_personal_info.rb
1  class CreatePersonalInfos < ActiveRecord::Migration[6.0] 
1  rake db:migrate 
app/models/person.rb
1  class Person < ApplicationRecord 
app/models/personal_info.rb
1  class PersonalInfo < ApplicationRecord 
you have build_personal_info(hash)
and create_personal_info(hash)
methods on a person instance
1  zhang = Person.first 
1  rails g model job title company position_id person:references 
db/migrate/_create_jobs.rb
1  class CreateJobs < ActiveRecord::Migration[6.0] 
1  rake db:migrate 
app/models/person.rb
1  class Person < ApplicationRecord 
app/models/job.rb
1  class Job < ApplicationRecord 
1  rails g model hobby name 
db/migrate/_create_habbies_people.rb
1  class CreateHabbiesPeople < ActiveRecord::Migration[6.0] 
1  rake db:migrate 
app/models/person.rb
1  class Person < ApplicationRecord 
app/models/hobby.rb
1  class Hobby < ApplicationRecord 
:through
option for this purposeparentchild
relationship and then use the child model as a join between the parent and grandchild.1  rails g model salary_range min_salary:float max_salary:float job:references 
db/migrate/_create_salary_ranges.rb
1  class CreateSalaryRanges < ActiveRecord::Migration[6.0] 
1  rake db:migrate 
app/models/salary_range.rb
1  class SalaryRange < ApplicationRecord 
app/models/job.rb
1  class Job < ApplicationRecord 
app/models/person.rb
1
2
3
4
5
6
7
8
9
10
11 class Person < ApplicationRecord
has_one :personal_info
has_many :jobs
has_and_belongs_to_many :hobbies
has_many :approx_salaries, through: :jobs, source: :salary_range
def max_salary
approx_salaries.maximum(:max_salary)
end
end
# Average, minimum and sum also available...
1  lebron = Person.find_by last_name: "James" 
ORM (ObjectRelational Mapping): Bridges the gap between
relational databases
, which are designed around mathematical Set Theory and ObjectOriented programming languages that deal with objects and their behavior. Greatly simplifies writing code for accessing the database.
plural name
that correspondssingular name
id
Use an empty constructor and (ghost) attributes to set the values and then call save.
1  p1 = Person.new 
Pass a hash of attributes into the constructor and then call save.
1  p2 = Person.new(name: "zhang", email: "zrc720@gmail.com") 
Use create method with a hash to create an object and save it to the database in one step.
1  p3 = Person.create(name: "zhang", email: "zrc720@gmail.com") 
find(id)
or find(id1, id2)
Throws a RecordNotFound exception if not found
1  Person.find(1) 
first
, last
, take
, all
Return the results you expect or nil if nothing is found
1  Person.first 
order(:column)
or order(column: :desc)
Allows ordering of the results. Ascending or descending
1  Person.all.order(first_name: :desc) 
pluck
Use pluck as a shortcut to select one or more attributes without loading a bunch of records just to grab the attributes you want.
1  Person.pluck(:id, :name) 
take
Gives a record (or N records if a parameter is supplied) without any implied order. The order will depend on the database implementation. If an order is supplied it will be respected.
1  Person.take # returns an object fetched by SELECT * FROM people LIMIT 1 
where(hash)
Enables you to supply conditions for your search
1  Person.where(name: "zhangruochi") 
limit
Enables you to limit how many records come back
1  Person.offset(1).limit(1) 
offset(n)
Don’t start from the beginning; skip a few
1  Person.offset(1).limit(1) 
Retrieve a record, modify the values and then call save
1  zhang = User.where(name: "zhangruochi") 
Retrieve a record and then call update method passing in a hash of attributes with new values
1  zhang = User.where(name: "zhangruochi") 
There is also update_all
for batch updates
1  User.where(email: 'zrc720@gmail.com').update_all(name: 'ruochi') 
1  zhang = User.first 
send
method1  class Dog 
define_method :method_name
and a block
which1  class Whatever 
1  require_relative 'store' 
1  class Mystery 
Struct
: Generator of specific classes, each one of which is defined to hold a set of variables and their accessors (“Dynamic Methods”)OpenStruct
: Object (similar to Struct) whose attributes are created dynamically when first assigned (“Ghost methods”)1  Customer = Struct.new(:name, :address) do # block is optional 
config/routes.rb
1  Rails.application.routes.draw do 
app/views/users/show.html.erb
1  <% provide(:title, @user.name) %> 
app/assets/stylesheets/custom.scss
1  /* sidebar */ 
app/helpers/users_helper.rb
1  module UsersHelper 
app/views/users/new.html.erb
1  <% provide(:title, 'Sign up') %> 
app/assets/stylesheets/custom.scss
1  /* forms */ 
注意，在上面的代码中，渲染的局部视图名为 ‘shared/error_messages’，这里用到了 Rails 的一个约定:如 果局部视图要在多个控制器中使用(10.1.1 节)，则把它存放在专门的 shared/ 目录中。
app/views/shared/_error_messages.html.erb
1  <% if @user.errors.any? %> 
1  /* forms */ 
1  class UsersController < ApplicationController . 
app/controllers/users_controller.rb
1  class UsersController < ApplicationController . 
app/views/layouts/application.html.erb
1 

config/environments/production.rb
1  Rails.application.configure do . 
config/puma.rb
1  # Puma configuration file. 
最后，我们要新建一个 Procfile 文件，告诉 Heroku 在生产环境运行一个 Puma 进程。这个文件的内容如代 码清单 7.36 所示。Procfile 文件和 Gemfile 文件一样，应该放在应用的根目录中。
./Procfile
1  web: bundle exec puma C config/puma.rb 
config/database.yml
1  default: 
1  rails g model User name:string email:string 
app/models/user.rb
1  class User < ApplicationRecord 
1  class User < ApplicationRecord 
1  class User < ApplicationRecord 
1  class User < ApplicationRecord 
rails generate migration add_index_to_users_email
1  class AddIndexToUsersEmail < ActiveRecord::Migration[6.0] 
1  rails db:migrate 
保证存储在数据库中的电子邮件都是小写字母的形式
1  class User < ApplicationRecord 
1  class User < ApplicationRecord 
在模型中调用这个方法后，会自动添加如下功能:
has_secure_password 发挥功效的唯一要求是，对应的模型中有个名为 password_digest 的属性。因此，创建一个适当的迁移文件，添加 password_digest 列。
1  rails generate migration add_password_digest_to_users password_digest:string 
1  class AddPasswordDigestToUsers < ActiveRecord::Migration[6.0] 
1  rails db:migrate 
has_secure_password 方法使用先进的 bcrypt 哈希算法计算密码摘要。使用 bcrypt 计算密码哈希值，就算攻击者设法获得了数据库副本也无法登录网站。我们要把 bcrypt gem 添加到 Gemfile 文件中。
]]>1  rails generate controller StaticPages home help 
1  rails generate controller StaticPages home help 
test/controllers/static_pages_controller_test.rb
1  class StaticPagesControllerTest < ActionDispatch::IntegrationTest 
app/views/static_pages/home.html.erb
通过<% … %>调用Rails提供的provide
函数，把字符串”Home”赋给:title。11然后，在标题中，我们使
用类似的符号<%= … %>，通过Ruby的yield
函数把标题插入模板中。
1  <% provide(:title, "Home") %> 
为了提取出共用的结构，Rails 提供了一个特别的布局文件，名为 application.html.erb。
app/views/layouts/application.html.erb
1 

这几行代码的作用是，引入应用的样式表和 JavaScript 文件; Rails 提供的 csp_meta_tag 方法实现内容安全策略(Content Security Policy，CSP)，避免遭受跨站脚本(crosssite scripting，XSS)攻击;Rails 提供的 csrf_meta_tags 方法用于避免跨站请求伪造(CrossSite Request Forgery，CSRF)攻击。
app/views/static_pages/home.html.erb
1  <% provide(:title, "Home") %> 
<%= yield %>
这行代码的作用是，把每个页面的内容插入布局中。在布局 中使用这行代码后，访问 /static_pages/home 时会把 home.html.erb 中的内容转换成 HTML，然后插入 <%= yield %>
所在的位置。
1  rails new toy_app 
1  ... 
1  # 安装gem时要指定without production选项，禁止安装生产环境使用的gem 
1  git init 
app/controllers/application_controller.rb
1  class ApplicationController < ActionController::Base 
config/routes.rb
1  Rails.application.routes.draw do 
1  rails generate scaffold User name:string email:string 
config/routes.rb
1  Rails.application.routes.draw do 
app/models/micropost.rb
1  class Micropost < ApplicationRecord 
1  rails db:migrate 
heroku
1  git add A 
Define the source of your inputs(X values)
1  ImageItemList.from_folder(path) 
Define how you want to split your inputs into training and validation datasets using one of the builtin mechanisms for doing so.
1  ImageItemList.from_folder(path) 
Define the source of your targets (that is your y values) and combine them with the inputs of your training and validation datasets in the form of fastai LabelList
objects. LabelList subclasses the PyTorch Dataset
class.
1  ImageItemList.from_folder(path) 
Add a test dataset (optional).
1  data = (ImageItemList.from_folder(path) 
Add transforms to your LabelList
objects (optional). Here you can apply data augmentation to either, or both, your inputs and targets.
1  data = (ImageItemList.from_folder(path) 
Build PyTorch DataLoaders from the Datasets defined above and package them up into a fastai DataBunch
.
1  data = (ImageItemList.from_folder(path) 
Once this is done, you’ll have everything you need to train, validate, and test any PyTorch nn.Module using the fastai library. You’ll also have everything you need to later do inference on future data.
1  class ImageTuple(ItemBase): 
1  class TargetTupleList(ItemList): 
_bunch
contains the name of the class that will be used to create a DataBunch_processor
contains a class (or a list of classes) of PreProcessor that will then be used as the default to create processor for this ItemList_label_cls
contains the class that will be used to create the labels by default1  class ImageTupleList(ImageList): 
by: Francisco Ingham and Jeremy Howard. Inspired by Adrian Rosebrock
1  from fastai.vision import * 
Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.
Scroll down until you’ve seen all the images you want to download, or until you see a button that says ‘Show more results’. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.
It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, “canis lupus lupus”, it might be a good idea to exclude other variants:
"canis lupus lupus" dog arctos familiaris baileyi occidentalis
You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.
Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.
In Google Chrome press CtrlShiftj on Windows/Linux and CmdOptj on macOS, and a small window the javascript ‘Console’ will appear. In Firefox press CtrlShiftk on Windows/Linux or CmdOptk on macOS. That is where you will paste the JavaScript commands.
You will need to get the urls of each of the images. Before running the following commands, you may want to disable ad blocking extensions (uBlock, AdBlockPlus etc.) in Chrome. Otherwise the window.open() command doesn’t work. Then you can run the following commands:
1  urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('datasrc')?el.getAttribute('datasrc'):el.getAttribute('dataiurl')); 
Choose an appropriate name for your labeled images. You can run these steps multiple times to create different labels.
1  help(download_images) 
Help on function download_images in module fastai.vision.data:download_images(urls:Collection[str], dest:Union[pathlib.Path, str], max_pics:int=1000, max_workers:int=8, timeout=4) Download images listed in text file `urls` to path `dest`, at most `max_pics`
1  path = Path('data/dogs') 
1  folder = 'akita' 
1  download_images(urls=urls, dest=dest, max_pics=200) 
Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?
1  folder = 'husky' 
Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?
1  folder = 'shibaInu' 
Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?
1  folder = 'alaska' 
Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?Error Invalid URL '': No schema supplied. Perhaps you meant http://?
1  help(DataBunch) 
Help on class DataBunch in module fastai.basic_data:class DataBunch(builtins.object)  Bind `train_dl`,`valid_dl` and `test_dl` in a data object.   Methods defined here:   __getattr__(self, k:int) > Any   __init__(self, train_dl:torch.utils.data.dataloader.DataLoader, valid_dl:torch.utils.data.dataloader.DataLoader, fix_dl:torch.utils.data.dataloader.DataLoader=None, test_dl:Union[torch.utils.data.dataloader.DataLoader, NoneType]=None, device:torch.device=None, dl_tfms:Union[Collection[Callable], NoneType]=None, path:Union[pathlib.Path, str]='.', collate_fn:Callable=<function data_collate at 0x7f14501736a8>, no_check:bool=False)  Initialize self. See help(type(self)) for accurate signature.   __repr__(self) > str  Return repr(self).   __setstate__(self, data:Any)   add_test(self, items:Iterator, label:Any=None, tfms=None, tfm_y=None) > None  Add the `items` as a test set. Pass along `label` otherwise label them with `EmptyLabel`.   add_tfm(self, tfm:Callable) > None   dl(self, ds_type:fastai.basic_data.DatasetType=<DatasetType.Valid: 2>) > fastai.basic_data.DeviceDataLoader  Returns appropriate `Dataset` for validation, training, or test (`ds_type`).   export(self, file:Union[pathlib.Path, str, _io.BufferedWriter, _io.BytesIO]='export.pkl')  Export the minimal state of `self` for inference in `self.path/file`. `file` can be filelike (file or buffer)   one_batch(self, ds_type:fastai.basic_data.DatasetType=<DatasetType.Train: 1>, detach:bool=True, denorm:bool=True, cpu:bool=True) > Collection[torch.Tensor]  Get one batch from the data loader of `ds_type`. Optionally `detach` and `denorm`.   one_item(self, item, detach:bool=False, denorm:bool=False, cpu:bool=False)  Get `item` into a batch. Optionally `detach` and `denorm`.   pre_transform = _db_pre_transform(self, train_tfm:List[Callable], valid_tfm:List[Callable])  Call `train_tfm` and `valid_tfm` after opening image, before converting from `PIL.Image`   presize = _presize(self, size:int, val_xtra_size:int=32, scale:Tuple[float]=(0.08, 1.0), ratio:Tuple[float]=(0.75, 1.3333333333333333), interpolation:int=2)  Resize images to `size` using `RandomResizedCrop`, passing along `kwargs` to train transform   remove_tfm(self, tfm:Callable) > None   sanity_check(self)  Check the underlying data in the training set can be properly loaded.   save(self, file:Union[pathlib.Path, str, _io.BufferedWriter, _io.BytesIO]='data_save.pkl') > None  Save the `DataBunch` in `self.path/file`. `file` can be filelike (file or buffer)   show_batch(self, rows:int=5, ds_type:fastai.basic_data.DatasetType=<DatasetType.Train: 1>, reverse:bool=False, **kwargs) > None  Show a batch of data in `ds_type` on a few `rows`.     Class methods defined here:   create(train_ds:torch.utils.data.dataset.Dataset, valid_ds:torch.utils.data.dataset.Dataset, test_ds:Union[torch.utils.data.dataset.Dataset, NoneType]=None, path:Union[pathlib.Path, str]='.', bs:int=64, val_bs:int=None, num_workers:int=6, dl_tfms:Union[Collection[Callable], NoneType]=None, device:torch.device=None, collate_fn:Callable=<function data_collate at 0x7f14501736a8>, no_check:bool=False, **dl_kwargs) > 'DataBunch' from builtins.type  Create a `DataBunch` from `train_ds`, `valid_ds` and maybe `test_ds` with a batch size of `bs`. Passes `**dl_kwargs` to `DataLoader()`   load_empty = _databunch_load_empty(path, fname:str='export.pkl') from builtins.type  Load an empty `DataBunch` from the exported file in `path/fname` with optional `tfms`.     Data descriptors defined here:   __dict__  dictionary for instance variables (if defined)   __weakref__  list of weak references to the object (if defined)   batch_size   dls  Returns a list of all DeviceDataLoaders. If you need a specific DeviceDataLoader, access via the relevant property (`train_dl`, `valid_dl`, etc) as the index of DLs in this list is not guaranteed to remain constant.   empty_val   fix_ds   is_empty   loss_func   single_ds   test_ds   train_ds   valid_ds
1  help(ImageDataBunch.from_folder) 
Help on method from_folder in module fastai.vision.data:from_folder(path:Union[pathlib.Path, str], train:Union[pathlib.Path, str]='train', valid:Union[pathlib.Path, str]='valid', test:Union[pathlib.Path, str, NoneType]=None, valid_pct=None, seed:int=None, classes:Collection=None, **kwargs:Any) > 'ImageDataBunch' method of builtins.type instance Create from imagenet style dataset in `path` with `train`,`valid`,`test` subfolders (or provide `valid_pct`).
1  help(ImageDataBunch.from_folder) 
Help on method from_folder in module fastai.vision.data:from_folder(path:Union[pathlib.Path, str], train:Union[pathlib.Path, str]='train', valid:Union[pathlib.Path, str]='valid', test:Union[pathlib.Path, str, NoneType]=None, valid_pct=None, seed:int=None, classes:Collection=None, **kwargs:Any) > 'ImageDataBunch' method of builtins.type instance Create from imagenet style dataset in `path` with `train`,`valid`,`test` subfolders (or provide `valid_pct`).
1  np.random.seed(7) 
1  data.show_batch(rows = 5) 
1  data.classes,data.c, len(data.train_ds), len(data.valid_ds) 
(['akita', 'alaska', 'husky', 'shibaInu'], 4, 512, 128)
1  learn = cnn_learner(data,models.resnet34, metrics = error_rate) 
1  learn.fit_one_cycle(4) 
epoch  train_loss  valid_loss  error_rate  time 

0  1.609119  0.857839  0.328125  00:05 
1  1.124259  0.611583  0.250000  00:05 
2  0.894434  0.618738  0.218750  00:05 
3  0.738053  0.625707  0.242188  00:05 
1  lean.load("stage1") 
Learner(data=ImageDataBunch;Train: LabelList (512 items)x: ImageListImage (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)y: CategoryListshibaInu,shibaInu,shibaInu,shibaInu,shibaInuPath: data/dogs;Valid: LabelList (128 items)x: ImageListImage (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)y: CategoryListshibaInu,alaska,akita,shibaInu,huskyPath: data/dogs;Test: None, model=Sequential( (0): Sequential( (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (1): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (4): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (2): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) (5): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (2): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (3): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) (6): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (2): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (3): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (4): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (5): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) (7): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) (2): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) (1): Sequential( (0): AdaptiveConcatPool2d( (ap): AdaptiveAvgPool2d(output_size=1) (mp): AdaptiveMaxPool2d(output_size=1) ) (1): Flatten() (2): BatchNorm1d(1024, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (3): Dropout(p=0.25) (4): Linear(in_features=1024, out_features=512, bias=True) (5): ReLU(inplace) (6): BatchNorm1d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (7): Dropout(p=0.5) (8): Linear(in_features=512, out_features=4, bias=True) )), opt_func=functools.partial(<class 'torch.optim.adam.Adam'>, betas=(0.9, 0.99)), loss_func=FlattenedLoss of CrossEntropyLoss(), metrics=[<function error_rate at 0x7f144e518e18>], true_wd=True, bn_wd=True, wd=0.01, train_bn=True, path=PosixPath('data/dogs'), model_dir='models', callback_fns=[functools.partial(<class 'fastai.basic_train.Recorder'>, add_time=True, silent=False)], callbacks=[], layer_groups=[Sequential( (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (1): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (5): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (6): ReLU(inplace) (7): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (9): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (10): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace) (12): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (13): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (14): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (15): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (16): ReLU(inplace) (17): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (18): BatchNorm2d(64, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (19): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (20): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (21): ReLU(inplace) (22): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (23): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (24): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (25): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (26): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (27): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (28): ReLU(inplace) (29): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (30): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (31): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (32): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (33): ReLU(inplace) (34): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (35): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (36): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (37): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (38): ReLU(inplace) (39): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (40): BatchNorm2d(128, eps=1e05, momentum=0.1, affine=True, track_running_stats=True)), Sequential( (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (5): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (6): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (9): ReLU(inplace) (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (13): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (14): ReLU(inplace) (15): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (16): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (18): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (19): ReLU(inplace) (20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (21): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (22): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (23): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (24): ReLU(inplace) (25): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (26): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (27): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (28): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (29): ReLU(inplace) (30): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (31): BatchNorm2d(256, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (32): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (33): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (34): ReLU(inplace) (35): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (36): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (37): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (38): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (39): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (40): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (41): ReLU(inplace) (42): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (43): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (44): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (45): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (46): ReLU(inplace) (47): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (48): BatchNorm2d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True)), Sequential( (0): AdaptiveAvgPool2d(output_size=1) (1): AdaptiveMaxPool2d(output_size=1) (2): Flatten() (3): BatchNorm1d(1024, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (4): Dropout(p=0.25) (5): Linear(in_features=1024, out_features=512, bias=True) (6): ReLU(inplace) (7): BatchNorm1d(512, eps=1e05, momentum=0.1, affine=True, track_running_stats=True) (8): Dropout(p=0.5) (9): Linear(in_features=512, out_features=4, bias=True))], add_time=True, silent=False)
1  lean.unfreeze() 
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
1  lean.recorder.plot() 
1  lean.fit_one_cycle(4,max_lr=slice(1e4,3e3)) 
epoch  train_loss  valid_loss  error_rate  time 

0  0.676978  0.840263  0.406250  00:06 
1  0.629555  0.816797  0.437500  00:06 
2  0.582286  0.740251  0.335938  00:06 
3  0.535621  0.722407  0.289062  00:06 
1  learn.save('stage2') 
1  interp = ClassificationInterpretation.from_learner(lean) 
1  interp.plot_confusion_matrix() 
Some of our top losses aren’t due to bad performance by our model. There are images in our data set that shouldn’t be.
Using the ImageCleaner
widget from fastai.widgets
we can prune our top losses, removing photos that don’t belong.
1  from fastai.widgets import * 
First we need to get the file paths from our top_losses. We can do this with .from_toplosses
. We then feed the top losses indexes and corresponding dataset to ImageCleaner
.
Notice that the widget will not delete images directly from disk but it will create a new csv file cleaned.csv
from where you can create a new ImageDataBunch with the corrected labels to continue training your model.
In order to clean the entire set of images, we need to create a new dataset without the split. The video lecture demostrated the use of the ds_type
param which no longer has any effect. See the thread for more details.
1  db = (ImageList.from_folder(path) 
1  learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate) 
1  ds, idxs = DatasetFormatter().from_toplosses(learn_cln) 
1  ImageCleaner(ds, idxs, path) 
'No images to show :)'
You can also find duplicates in your dataset and delete them! To do this, you need to run .from_similars
to get the potential duplicates’ ids and then run ImageCleaner
with duplicates=True
. The API works in a similar way as with misclassified images: just choose the ones you want to delete and click ‘Next Batch’ until there are no more images left.
Make sure to recreate the databunch and learn_cln
from the cleaned.csv
file. Otherwise the file would be overwritten from scratch, losing all the results from cleaning the data from toplosses.
1  doc(ImageDataBunch.from_csv) 
1  data_cleand = ImageDataBunch.from_csv(path,csv_labels = "cleaned.csv",ds_tfms=get_transforms(),valid_pct = 0.2, size = 224, bs = 32).normalize() 
1  data_cleand.show_batch(rows=4) 
1  final_learn = cnn_learner(data_cleand, models.resnet50, metrics = error_rate) 
1  final_learn.fit_one_cycle(5) 
epoch  train_loss  valid_loss  error_rate  time 

0  1.187646  0.361647  0.114943  00:07 
1  0.830483  0.527104  0.149425  00:06 
2  0.620431  0.356172  0.126437  00:06 
3  0.493722  0.324358  0.114943  00:06 
4  0.417320  0.308305  0.103448  00:06 
1  final_learn.save("stage1") 
1  # final_learn.unfreeze() 
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
1  final_learn.recorder.plot() 
1  final_learn.fit_one_cycle(1, max_lr=slice(1e3,3e3)) 
epoch  train_loss  valid_loss  error_rate  time 

0  0.218457  0.279203  0.080460  00:06 
1  final_learn.save("stagefinal") 
1  interp = ClassificationInterpretation.from_learner(final_learn) 
1  interp.plot_confusion_matrix() 
1  interp.plot_top_losses(9, figsize=(15,11)) 
1  !pip install starlette 
Collecting starlette[?25l Downloading https://files.pythonhosted.org/packages/e1/22/360e07bf7852cc83fecf701bae502b90b62fa8584c4b70ee7bb1ea216dfe/starlette0.10.2.tar.gz (45kB)[K ████████████████████████████████ 51kB 3.4MB/s eta 0:00:01[?25hBuilding wheels for collected packages: starlette Building wheel for starlette (setup.py) ... [?25ldone[?25h Created wheel for starlette: filename=starlette0.10.2cp35noneany.whl size=58484 sha256=5b3216465d63bb24678d9e6b54680ce1f8d208bafe509c2df635df50a522e26a Stored in directory: /home/zhangruochi/.cache/pip/wheels/28/c6/04/886febad73c854ba4e8a5b4f40aeaec3a97a9ab77795d141fbSuccessfully built starletteInstalling collected packages: starletteSuccessfully installed starlette0.10.2
1  final_learn.export() 
1  defaults.device = torch.device('cpu') 
1  img = open_image(path/'akita'/'00000100.jpg') 
1  serving_model = load_learner(path) 
1  pred_class,pred_idx,outputs = learn.predict(img) 
1  pred_class 
Category akita
1  from starlette.applications import Starlette 
1 
ELMo means Embeddings from Language Models
. the original paper is from https://arxiv.org/abs/1802.05365
ELMo gained its language understanding from being trained to predict the next word in a sequence of words  a task called Language Modeling
. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.
A step in the pretraining process of ELMo: Given “Let’s stick to” as input, predict the next most likely word – a language modeling task. When trained on a large dataset, the model starts to pick up on language patterns. It’s unlikely it’ll accurately guess the next word in this example. More realistically, after a word such as “hang”, it will assign a higher probability to a word like “out” (to spell “hang out”) than to “camera”.
ELMo actually goes a step further and trains a bidirectional LSTM – so that its language model doesn’t only have a sense of the next word, but also the previous word.
The input of elmo is char embedding, see the details from https://zhangruochi.com/SubwordModels/2019/12/19/
We can feed our input data to the pretrained ELMo and get the representation of dynamic word vectors
. And then we use them to our specific tasks.
It turns out we don’t need an entire Transformer to adopt transfer learning and a finetunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens
– a valuable feature when it’s generating a translation word by word.
The model stacked twelve decoder layers. Since there is no encoder in this set up, these decoder layers would not have the encoderdecoder attention sublayer that vanilla transformer decoder layers have. It would still have the selfattention layer, however (masked so it doesn’t peak at future tokens).
With this structure, we can proceed to train the model on the same language modeling task: predict the next word using massive (unlabeled) datasets. Just, throw the text of 7,000 books at it and have it learn! Books are great for this sort of task since it allows the model to learn to associate related information even if they’re separated by a lot of text – something you don’t get for example, when you’re training with tweets, or articles.
$W_e$ is the embedding matrix, $W_p$ is the positional embedding matrix(Note that it is different with classicial transformer)
Now that the OpenAI transformer is pretrained and its layers have been tuned to reasonably handle language, we can start using it for downstream tasks. Let’s first look at sentence classification (classify an email message as “spam” or “not spam”):
If our input sequence is $x_1,\cdots,x_m$, and the label is y. We can add a softmax layer
to do classification and use the cross entrophy to calculate the loss.
In general, we should update the parameters to minimize the $L_2$, but we can use Multitask Learning
to get a more generalize model. Therefore we can get the max likelihood of $L3$
$L_1$ if the loss of previous language model.
The OpenAI paper outlines a number of input transformations to handle the inputs for different types of tasks. The following image from the paper shows the structures of the models and input transformations to carry out different tasks.
The openAI transformer gave us a finetunable pretrained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo’s language model was bidirectional, but the openAI transformer only trains a forward language model. Could we build a transformerbased model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?
The input representation of BERT is shown in the figure below. For example, the two sentences “my dog is cute” and “he likes playing” are entered. I’ll explain why two sentences are needed later. Here, the two sentences similar to GPT are used. First, a special Token [CLS]
is added at the beginning of the first sentence, and a [SEP]
is added after the cute to indicate the end of the first sentence. After ##ing, A [SEP]
will be added later. Note that the word segmentation here will divide “playing” into “play” and “##ing” two tokens. This method of dividing words into more finegrained Word Pieces was introduced in the previous machine translation section. This is a kind of Common methods to resolve unregistered words. Then perform 3 Embeddings on each Token:
The word Embedding is familiar to everyone, and the position Embedding is similar to the word embedding, mapping a position (such as 2) into a lowdimensional dense vector. And Segment embedding has only two, either belong to the first sentence (segment) or belong to the second sentence. Segment Embedding of the same sentence is shared so that it can learn information belonging to different segments. For tasks such as sentiment classification, there is only one sentence, so the Segment id is always 0; for the Entailment task, the input is two sentences, so the Segment is 0 or 1.
The BERT model requires a fixed sequence length, such as 128. If it is not enough, then padding in the back, otherwise it will intercept the excess Token, so as to ensure that the input is a fixedlength Token sequence. The first token is always special [CLS]
. It does not have any semantics, so it will (must) encode the semantics of the entire sentence (other words).
Finding the right task to train a Transformer stack of encoders is a complex hurdle that BERT resolves by adopting a masked language model
concept from earlier literature (where it’s called a Cloze task).
Beyond masking 15% of the input, BERT also mixes things a bit in order to improve how the model later finetunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.
If you look back up at the input transformations the OpenAI transformer does to handle different tasks, you’ll notice that some tasks require the model to say something intelligent about two sentences (e.g. are they simply paraphrased versions of each other? Given a wikipedia entry as input, and a question regarding that entry as another input, can we answer that question?).
To make BERT better at handling relationships between multiple sentences, the pretraining process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?
The BERT paper shows a number of ways to use BERT for different tasks.
[CLS]
to connect it. Softmax is used for classification, and classified data is used for FineTuning.The finetuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pretrained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind finetuning BERT on a task such as namedentity recognition.
Which vector works best as a contextualized embedding? I would think it depends on the task. The paper examines six choices (Compared to the finetuned model which achieved a score of 96.4):
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attentionweighted positions, an effect we counteract with MultiHead Attention.
Selfattention, sometimes called intraattention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Selfattention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning taskindependent sentence representations. Endtoend memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simplelanguage question answering and
language modeling tasks.
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on selfattention to compute representations of its input and output without using sequence aligned RNNs or convolution.
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{\text{model}}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted nexttoken probabilities. In our model, we share the same weight matrix between the two embedding layers and the presoftmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$.
1  class Embeddings(nn.Module): 
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text{model}}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).
In this work, we use sine and cosine functions of different frequencies:
where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.
In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$.
1  class PositionalEncoding(nn.Module): 
We employ a residual connection (cite) around each of the two sublayers, followed by layer normalization (cite).
1  class LayerNorm(nn.Module): 
That is, the output of each sublayer is $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function implemented by the sublayer itself. We apply dropout (cite) to the output of each sublayer, before it is added to the sublayer input and normalized.
To facilitate these residual connections, all sublayers in the model, as well as the embedding layers, produce outputs of dimension $d_{\text{model}}=512$.
1  class SublayerConnection(nn.Module): 
Each layer has two sublayers. The first is a multihead selfattention mechanism, and the second is a simple, positionwise fully connected feedforward network.
An attention function can be described as mapping a query
and a set of key
value
pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
We call our particular attention Scaled DotProduct Attention
. The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:
The two most commonly used attention functions are additive attention (cite), and dotproduct (multiplicative) attention. Dotproduct attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feedforward network with a single hidden layer. While the two are similar in theoretical complexity, dotproduct attention is much faster and more spaceefficient in practice, since it can be implemented using highly optimized matrix multiplication code.
While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ (cite). We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$. Then their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance $d_k$.). To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.
Multihead attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Where the projections are parameter matrices $W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$. In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of singlehead attention with full dimensionality.
1  class MultiHeadedAttention(nn.Module): 
The Transformer uses multihead attention in three different ways:
In encoderdecoder attention layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder
. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoderdecoder attention mechanisms in sequencetosequence models such as (cite).
The encoder contains selfattention layers. In a selfattention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
In addition to attention sublayers, each of the layers in our encoder and decoder contains a fully connected feedforward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{\text{model}}=512$, and the innerlayer has dimensionality $d_{ff}=2048$.
1  class PositionwiseFeedForward(nn.Module): 
Most competitive neural sequence transduction models have an encoderdecoder structure (cite). Here, the encoder maps an input sequence of symbol representations $(x_1, …, x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, …, z_n)$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_1,…,y_m)$ of symbols one element at a time. At each step the model is autoregressive (cite), consuming the previously generated symbols as additional input when generating the next.
1  class EncoderDecoder(nn.Module): 
1  class Generator(nn.Module): 
The encoder is composed of a stack of $N=6$ identical layers.
1  def clones(module, N): 
1  class Encoder(nn.Module): 
1  class EncoderLayer(nn.Module): 
The decoder is also composed of a stack of $N=6$ identical layers.
1  class Decoder(nn.Module): 
In addition to the two sublayers in each encoder layer, the decoder inserts a third sublayer, which performs multihead attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sublayers, followed by layer normalization.
1  class DecoderLayer(nn.Module): 
We also modify the selfattention sublayer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.
1  def subsequent_mask(size): 
Here we define a function from hyperparameters to a full model.
1  def make_model(src_vocab, tgt_vocab, N=6, 
A word segmentation algorithm:
For example, all the words in our documents database and their frequency are
{‘l o w’: 5, ‘l o w e r’: 2, ‘n e w e s t’: 6, ‘w i d e s t’: 3}
We can initialize our vocabulary library as:
{ ‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ‘n’, ‘w’, ‘s’, ‘t’, ‘i’, ‘d’}
The most frequent ngram pair is (‘e’,’s’) and its count is 9. So we add the ‘es’ to our vocabulary library.
Our documents database now is:
{‘l o w’: 5, ‘l o w e r’: 2, ‘n e w es t’: 6, ‘w i d es t’: 3}.
Our vocabulary library now is:
{ ‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ‘n’, ‘w’, ‘s’, ‘t’, ‘i’, ‘d’, ‘es’}
Again, the most frequent ngram pair is (‘es’,’t’) and its count is 9，So we add the ‘est’ to our vocabulary library.
Our documents database now is:
{‘l o w’: 5, ‘l o w e r’: 2, ‘n e w est’: 6, ‘w i d est’: 3}
Our vocabulary library now is:
{ ‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ‘n’, ‘w’, ‘s’, ‘t’, ‘i’, ‘d’, ‘es’,’est’}
the rest can be done in the same manner. We can set a threshold of total count of our vocabulary library. By doing so, we can use BPE to construct a vocabulary library to represent all the words based on subword unit.
Google NMT(GNMT) uses a variant of this:
Padding and embedding lookup. Using a special
1  def pad_sents_char(sents, char_pad_token): 
For each of these characters $c_i$, we lookup a dense character embedding (which has shape $e_{char}$). This yields a tensor $x_{emb}$:
We’ll reshape $x_{emb}$ to obtain $x_{reshaped} in \mathbb{R}^{e_{char} \times m_{word}}$ before feeding into the convolutional network.
Convolutional network. To combine these character embeddings, we’ll use 1dimensional convolutions. The convolutional layer has two hyperparameters: the kernel size $k$ (also called window size), which dictates the size of the window used to compute features, and the number of filters $f$, (also called number of output features or number of output channels). The convolutional layer has a weight matrix $W \in \mathbb{R}^{f \times e_{char} \times k}$ and a bias vector $b \in \mathbb{R}^{f}$. Overall this produces output $x_{conv}$.
For our application, we’ll set $f$ to be equal to $e_{word}$, the size of the final word embedding for word x. Therefore,
Finally, we apply the ReLU
function to $x_{conv}$, then use maxpooling to reduce this to a single vector $x_{conv_out} \in \mathbb{R}^{e_{word}}$, which is the final output of the Convolutional Network:
1  #!/usr/bin/env python3 
Highway layer and dropout. Highway Networks6 have a skipconnection controlled by a dynamic gate. Given the input $x_{conv\_out} \in \mathbb{R}^{e_{word}}$, we compute:
Where $W_{proj},W_{gate} \in \mathbb{R}^{e_{word} \times e_{word}}$, and $\circ$ denotes elementwise multiplication.
1  #!/usr/bin/env python3 
Combine above steps together to get our Characterbased word embedding model.
1  #!/usr/bin/env python3 
We will now add a LSTMbased characterlevel decoder to our NMT system. The main idea is that when our wordlevel decoder produces and <UNK>
token, we run our characterlevel decoder (which you can think of as a characterlevel conditional language model) to instead generate the target word one character at a time, as shown in below figure. This will help us to produce rare and outofvocabulary target words.
We now describe the model in three sections:
Forward computation of Character Decoder: Given a sequence of integers $x_i,\cdots,x_n \in \mathbb{Z}$ representing a sequence of characters, we lookup their character embeddings $x_i,\cdots,x_n \in \mathbb{Z}^{e_{char}}$ and pass these as input in to the(unidirectional)LSTM,obtaining hidden states $h1, \cdots, h_n$ and cell states $c_1, \cdots, c_n$
where h is the hidden size of the CharDecoderLSTM. The initial hidden and cell states $h_0$ and $c_0$ are both set to the combined output vector (attentioned) for the current timestep of the main wordlevel NMT decoder.
For every timestep $t \in { 1, \cdots, n }$ we compute scores (also called logits) $s_t \in \mathbb{R}^{V_{char}}$
Where the weight matrix $W_{dec} \in \mathbb{R}^{V_{char} \times h}$ and the bias vector $b_{dec} \in \mathbb{R}^{V_{char}}$. If we passed $s_t$ through a softmax function, we would have the probability distribution for the next character in the sequence.
Training of Character Decoder When we train the NMT system, we train the character decoder on every word in the target sentence (not just the words represented by
We pass the input sequence $x_1, \cdots, x_n$, along with the initial states $h_0$ and $c_0$ obtained from the combined output vector) into the CharDecoderLSTM, thus obtaining scores $s_1,\cdots, s_n$ which we will compare to the target sequence $x_2,\cdots, x_{n+1}$. We optimize with respect to sum of the crossentropy loss:
Decoding from the Character Decoder t test time, first we produce a translation from our word based NMT system in the usual way (e.g. a decoding algorithm like beam search). If the translation contains any
1  #!/usr/bin/env python3 
Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
Bidirectional RNNs fix this problem by traversing a sequence in both directions and concatenating the resulting outputs (both cell outputs and final hidden states). For every RNN cell, we simply add another cell but feed inputs to it in the opposite direction; the output $O_t$ corresponding to the $t\prime$ word is the concatenated vector $\left [ o_t^{(f)}, o_t^{(b)} \right ]$ where $o_t^{(f)}$ is the output of the forwarddirection RNN on word t and $o_t^{(b)}$ is the corresponding output from the reversedirection RNN. Similarly, the final hidden state is $h = \left [ h^{(f)}, h^{(b)} \right ]$.
Sequencetosequence, or “Seq2Seq”, is a relatively new paradigm,with its first published usage in 2014 for EnglishFrench translation. At a high level, a sequencetosequence model is an endtoend model made up of two recurrent neural networks:
Sutskever et al. 2014, “Sequence to Sequence Learning with Neural Networks”
Encoder RNN produces an encoding of the source sentence.
The encoder network’s job is to read the input sequence to our Seq2Seq model and generate a fixeddimensional context vector C for the sequence. To do so, the encoder will use a recurrent neural network cell – usually an LSTM – to read the input tokens one at a time. The final hidden state of the cell will then become C. However, because it’s so difficult to compress an arbitrarylength sequence into a single fixedsize vector (especially for difficult tasks like transla tion), the encoder will usually consist of stacked LSTMs: a series of LSTM “layers” where each layer’s outputs are the input sequence to the next layer. The final layer’s LSTM hidden state will be used as C.
Seq2Seq encoders will often do something strange: they will pro cess the input sequence in reverse. This is actually done on purpose. The idea is that, by doing this, the last thing that the encoder sees will (roughly) corresponds to the first thing that the model outputs; this makes it easier for the decoder to “get started” on the output, which makes then gives the decoder an easier time generating a proper output sentence. In the context of translation, we’re allowing the network to translate the first few words of the input as soon as it sees them; once it has the first few words translated correctly, it’s much easier to go on to construct a correct sentence than it is to do so from scratch.
Decoder RNN is a Language Model that generates target sentence, conditioned on encoding.
The decoder is also an LSTM network, but its usage is a little more complex than the encoder network. Essentially, we’d like to use it as a language model that’s “aware” of the words that it’s generated so far and of the input. To that end, we’ll keep the “stacked” LSTM architecture from the encoder, but we’ll initialize the hidden state of our first layer with the context vector from above; the decoder will literally use the context of the input to generate an output.
Once the decoder is set up with its context, we’ll pass in a special token to signify the start of output generation; in literature, this is usually an
Once we have the output sequence, we use the same learning strat egy as usual. We define a loss, the cross entropy on the prediction sequence, and we minimize it with a gradient descent algorithm and backpropagation. Both the encoder and decoder are trained at the same time, so that they both learn the same context vector represen tation.
At each time step, we pick the most probable token. In other words
This technique is efficient and natural, however it explores a small part of the search space and if we make a mistake at one time step, the rest of the sentence could be heavily impacted.
the idea is to maintain K candidates at each time step.
and compute $H_{t+1}$ by expanding $H_t$ and keeping the best K candi dates. In other words, we pick the best K sequence in the following set
where
As we increase K, we gain precision and we are asymptotically exact. However, the improvement is not monotonic and we can set a K that combines reasonable performance and computational efficiency.
In Machine Translation, our goal is to convert a sentence from the source language (e.g. Spanish) to the target language (e.g. English). In this assignment, we will implement a sequencetosequence (Seq2Seq) network with attention, to build a Neural Machine Translation (NMT) system. In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder.
1  def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2): 
Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding $x_1,\cdots,x_m  x_i \in \mathbb{R}^{e x 1}$, where m is the length of the source sentence and e is the embedding size. We feed these embeddings to the bidirectional Encoder, yielding hidden states and cell states for both the forwards (>) and backwards (<) LSTMs. The forwards and backwards versions are concatenated
to give hidden states $h_i^{enc}$ and cell states $c_i^{enc}$
We then initialize the Decoder’s first hidden state $h_0^{dec}$ and cell state $c_0^{dec}$ with a linear projection of the Encoder’s final hidden state and final cell state
1  def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) > Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]: 
With the Decoder initialized, we must now feed it a matching sentence in the target language. On the $t^{th}$ step, we look up the embedding for the $t^{th}$ word, $y_t \in \mathbb{R}^{e x 1}$, we then concatenate $y_t$ with the combinedoutput vector $O_{t1} \in \mathbb{R}^{h x 1}$ from the previous step to produce $\bar{y_t} \in \mathbb{R}^{(e+h) x 1}$. Note that for the first target word $O_0$ is zerovector. We then fedd $\bar{y_t}$ as input to the Decoder LSTM.
We then use $h_t^{dec}$ to compute multiplicative attention ovev $h_t^{enc}, \cdots, h_m^{enc}$
We now concatenate the attention output $a_t$ with the decoder hidden state $h_t^{dec}$ and pass this through a linear layer, Tanh, and Dropout to attain the combinedoutput vector $o_t$
Then, we produce a probability distribution $P_t$ over target words at the $t^{th}$ timestep:
Here, $V_t$ is the size of the target vocabulary. Finally, to train the network we then compute the softmax cross entropy loss between $P_t$ and $g_t$, where $g_t$ is the 1hot vector of the target word at timestep t:
1  def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor, 
1  def forward(self, source: List[List[str]], target: List[List[str]]) > torch.Tensor: 
1  #!/usr/bin/env python3 
1  def pad_sents(sents, pad_token): 
RNNs have been found to perform better with the use of more complex units for activation. Here, we discuss the use of a gated activation function thereby modifying the RNN architecture. What motivates this? Well, although RNNs can theoretically capture longterm dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture longterm dependencies. Let us see mathematically how a GRU uses $h_{t−1}$ and $x_t$ to generate the next hidden state ht. We will then dive into the intuition of this architecture.
The above equations can be thought of a GRU’s four fundamental operational stages and they have intuitive interpretations that make this model much more intellectually.
LongShortTermMemories are another type of complex activation unit that differ a little from GRUs. The motivation for using these is similar to those for GRUs however the architecture of such units does differ. Let us first take a look at the mathematical formulation of LSTM units before diving into the intuition behind this design:
This course provides an introduction to human factors research and applications with emphasis on mature areas such as sensation and perception and manual control. Each class will introduce some concrete human factors problem and explore theory and application relevant to solving it. The term long conceptual design assignment is intended to help maintain focus on applications to design.
Learning Objectives:
为了研究如何使系统更好地和人的能力相匹配所进行的一种描述人机交互的方法
每个功能由人来实现还是系统来实现
Learning Objectives:
Error of Habituation
警报的设计必须建立在对人类的听觉加工充分了解的基础上
Patterns of rare faction/compression of air
the presence of tone that inhibits the perception of another tone that occurs before, at the same time, or after it.
Learning Objectives:
Overview of vision
c = (LD) / (L+D)
对单色进行设计，然后将颜色作为冗余编码信息提供
当照明条件比较差时，所有空间评率的对比度都会降低
Adaptation is a major characteristic of sensation
Learning Objectives:
Hick’s Law holds that choice reaction time is proportional to log2 of the number of alternatives.（反应时间是log2N的函数）
RT = a + blog2N
Controls are used by the human operator to communicate with the machine/device in the system. It’simportantthatcontrolsservetheirfunction. Based on:
Display  anything that conveys information
Learning Objectives:
Cues to depth & distance
Use of emergent features and perceptual salience to integrate displays
– Position (X,Y, Z)
– Orientation (Yaw, Pitch, Roll)
Show each variable as a single output
Learning Objectives:
补偿追踪与尾随追踪
MT(movement time) = a + blog2(2A/W)
Pointing as expressed by Fitts law is a very special case of tracking.
In pointing:
Learning Objectives:
通道的选择性注意主要受下面因素的影响
For the mathematics to work out the probabilities of failure for redundant components or subsystems must be completely independent
用于设计的用户绩效模型
用户可以通过方法和选择形成他们要达到的目标和子目标。方法是一系列知觉的、认知的或行为操作的步骤。
用户导向界面设计的七阶段理论
Improvement in performance is logarithmic in the N of trials
跨越实施和评价的鸿沟依赖于心理模型，好的心理模型可以帮助房主错误和改进绩效
使用户看不见的部分变为可见 比如”房间”
During noise, speakers have an automatic normalization response that causes systematic speech modifications, including increased volume, reduced speaking rate, and changes in articulation and pitch.
Issues in Ubicomp
Learning Objectives
Loss hurts more than Gain helps （抛硬币，正面赢20，反面输 10，大多数人选择不玩）
Undervalue base rates!!
Error: treating independent events as though they were dependent
人们做决策总会基于 avaliability & imaginability
确定偏差:简而言之就是听不进新的观点，无论怎样论证都是认为自己原本认为的是对的， 本质还是overconfidence
例如：给一些本身对于某件事有观点的人接受正反两面信息，人们通常都只注意到支持自己观点的理论，暗中反驳不符合自己观点的理论
人类在对事件做出判断的时候，过度关注于这个事件的某个特征，而忽略了这个事件发生的大环境概率和样本大小。
例如，你看到一家公司连续3年利润都翻番，然后立即对它的股票做出判断——买！错在代表性偏差。连续3年利润翻番，是一个好公司的代表性特征。但这并不意味着这家公司真的就是一家好公司，这家公司还有好多信息都被你忽略掉了。比如说，业绩可能是有意调整出来的；再比如说，这家公司未来的盈利机会消失，业绩不能持续。
人们在对某人某事做出判断时，易受第一印象影响从而先入为主
]]>Parse trees in NLP, analogous to those in compilers, are used to analyze the syntactic structure of sentences. There are two main types of structures used:
Dependency structure of sentences shows which words depend on (modify or are arguments of) which other words. These binary asymmetric relations between the words are called dependencies and are depicted as arrows going from the head (or governor, superior, regent) to the dependent (or modifier, inferior, subordinate). Usually these dependencies form a tree structure. They are often typed with the name of grammatical relations (subject, prepositional object, apposition, etc.). An example of such a dependency tree is shown in below
Usually some constraints:
Given a parsing model M and a sentence S, derive the optimal dependency graph D for S according to M.
A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between head words, and words which modify those heads. Your implementation will be a transitionbased parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:
Initially, the stack only contains ROOT, the dependencies list is empty, and the buffer contains all words of the sentence in order. At each step, the parser applies a transition to the partial parse until its buffer is empty and the stack size is 1. The following transitions can be applied:
On each step, your parser will decide among the three transitions using a neural network classifier.Go through the sequence of transitions needed for parsing the sentence “I parsed this sentence correctly”. The dependency tree for the sentence is shown below. At each step, give the configuration of the stack and buffer, as well as what transition was applied this step and what new dependency was added (if any). The first three steps are provided below as an example.
1  #!/usr/bin/env python3 
We are now going to train a neural network to predict, given the state of the stack, buffer, and
dependencies, which transition should be applied next. First, the model extracts a feature vector
representing the current state. They can be represented as a list of integers $[w_1,w_2,\cdots,w_m]$ where m is the number of features and each $0 \leq w_i < V$ is the index of a token in the vocabulary (V is the vocabulary size). First our network looks up an embedding for each word and concatenates them into a single input vector:
We then compute our prediction as:
where $h$ is referred to as the hidden layer,$l$ is referred to as the logits, $\hat{y}$ is referred to as the predictions. We will train the model to minimize crossentropy loss:
1  #!/usr/bin/env python3 
1  #!/usr/bin/env python3 
1  import numpy as np 
1  def sigmoid(x): 
In word2vec, the conditional probability distribution is given by taking vector dotproducts and applying the softmax function:
The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is:
So that the naivesoftmax loss for word2vec given in following equation is the same as the crossentropy loss between $y$ and $\hat{y}$:
For the backpropagation, lets introduce the intermediate variable $p$, which is a vector of the (normalized) probabilities. The loss for one example is:
We now wish to understand how the computed scores inside $f$ should change to decrease the loss $L_i$ that this example contributes to the full objective. In other words, we want to derive the gradient $\frac{\partial L_i}{\partial f_k}$. The loss $L_i$ is computed from $p$ which in turn depends on $f$.
Notice how elegant and simple this expression is. Suppose the probabilities we computed were p = [0.2, 0.3, 0.5]
, and that the correct class was the middle one (with probability 0.3
). According to this derivation the gradient on the scores would be df = [0.2, 0.7, 0.5]
.
1 

Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1,w_2,\cdots,w_k$ and their outside vectors as $u_1,\cdots,u_k$. Note that $o \in {w_1, \cdots, w_k}$. For a center word c and an outside word o, the negative sampling loss function is given by:
The sigmoid function and its gradient is as follows:
1  def negSamplingLossAndGradient( 
Suppose the center word is $c = w_t$ and the context window is $[w_{tm},\cdots,w_{t1},\cdots, w_{t}, \cdots, w_{t+1}, \cdots,w_{t+m} ]$, where m is the context window size. Recall that for the skipgram version of word2vec, the total loss for the context window is:
Here, $J(v_c,w_{t+j},U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word
1  def skipgram(currentCenterWord, windowSize, outsideWords, word2Ind, 
The gates we introduced above are relatively arbitrary. Any kind of differentiable function can act as a gate, and we can group multiple gates into a single gate, or decompose a function into multiple gates whenever it is convenient. Lets look at another expression that illustrates this point:
We have the knowlege of derivatives
the picture blow shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit.
It turns out that the derivative of the sigmoid function with respect to its input simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in the numerator):
As we see, the gradient turns out to simplify and becomes surprisingly simple. For example, the sigmoid expression receives the input 1.0 and computes the output 0.73 during the forward pass. The derivation above shows that the local gradient would simply be (1  0.73) * 0.73 = 0.2, as the circuit computed before (see the image above), except this way it would be done with a single, simple and efficient expression (and with less numerical issues). Therefore, in any real practical application it would be very useful to group these operations into a single gate. Lets see the backprop for this neuron in code:
1  w = [2,3,3] # assume some random weights and data 
Lets see this with another example. Suppose that we have a function of the form:
We don’t need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it. Here is how we would structure the forward pass of such expression:
1  x = 3 # example values 
Computing the backprop pass is easy: We’ll go backwards and for every variable along the way in the forward pass (sigy, num, sigx, xpy, xpysqr, den, invden) we will have the same variable, but one that begins with a d
, which will hold the gradient of the output of the circuit with respect to that variable.
1  # backprop f = num * invden 
1  class MultiplyGate: 
We want to encode word tokens each into some vector that represents a point in some sort of “word” space. This is paramount for a number of reasons but the most intuitive reason is that perhaps there actually exists some Ndimensional space (such that N << 13 million) that is sufficient to encode all semantics of our language. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might indicate tense (past vs. present vs. future), count (singular vs. plural), and gender (masculine vs. feminine).
Represent every word as an $\mathbb{R}^{v\cdot 1}$ vector with all 0s and one 1 at the index of that word in the sorted english language. $V$ is the size of our vocabulary. Word vectors in this type of encoding would appear as the following:
We represent each word as a completely independent entity. This word representation does not give us directly any notion of similarity. For instance,
For this class of methods to find word embeddings (otherwise known as word vectors), we first loop over a massive dataset and accumulate word cooccurrence counts in some form of a matrix X, and then perform Singular Value Decomposition on X to get a
$USV^{T}$ decomposition. We then use the rows of U as the word embeddings for all words in our dictionary. Let us discuss a few choices of X.
As our first attempt, we make the bold conjecture that words thatare related will often appear in the same documents. We use this fact to build a worddocument matrix, $X$ in the following manner: Loop over billions of documents and for each time word $i$ appears in document $j$, we add one to entry $X_{ij}$. This is obviously a very large matrix $\mathbb{R}^{v\cdot M}$ and it scales with the number of documents (M). So perhaps we can try something better.
A cooccurrence matrix counts how often things cooccur in some environment. Given some word $w_i$ occurring in the document, we consider the context window surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{in} \dots w_{i1}$ and $w_{i+1} \dots w_{i+n}$. We build a cooccurrence matrix $M$, which is a symmetric wordbyword matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$’s window.
Example: CoOccurrence with Fixed Window of n=1:
*  START  all  that  glitters  is  not  gold  well  ends  END 

START  0  2  0  0  0  0  0  0  0  0 
all  2  0  1  0  1  0  0  0  0  0 
that  0  1  0  1  0  0  0  1  1  0 
glitters  0  0  1  0  1  0  0  0  0  0 
is  0  1  0  1  0  1  0  1  0  0 
not  0  0  0  0  1  0  1  0  0  0 
gold  0  0  0  0  0  1  0  0  0  1 
well  0  0  1  0  1  0  0  0  1  1 
ends  0  0  1  0  0  0  0  1  0  0 
END  0  0  0  0  0  0  1  1  0  0 
Note: In NLP, we often add START and END tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine START and END tokens encapsulating each document, e.g., “START All that glitters is not gold END”, and include these tokens in our cooccurrence counts.
The rows (or columns) of this matrix provide one type of word vectors (those based on wordword cooccurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run dimensionality reduction. In particular, we will run SVD (Singular Value Decomposition), which is a kind of generalized PCA (Principal Components Analysis) to select the top $k$ principal components. Here’s a visualization of dimensionality reduction with SVD. In this picture our cooccurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length$k$ word vectors in $U_k$.
Eigenvalues quantify the importance of information along the line of eigenvectors. Equipped with this information, we know what part of the information can be ignored and how to compress information (SVD, Dimension reduction & PCA). It also helps us to extract features in developing machine learning models. Sometimes, it makes the model easier to train because of the reduction of tangled information. It also serves the purpose to visualize tangled raw data.
for Eigenvalues $\lambda$ and Eigenvector $V$, we have:
the dimension of A is $\mathbb{R}^{n\cdot n}$ and $V$ is a $\mathbb{R}^{n\cdot 1}$ vector.
Let’s assume a matrix A has two eigenvalues and eigenvectors.
We can concatenate them together and rewrite the equations in the matrix form.
We can generalize it into any number of eigenvectors as
A square matrix A is diagonalizable if we can convert it into a diagonal matrix, like
An n × n square matrix is diagonalizable if it has n linearly independent eigenvectors. If a matrix is symmetric, it is diagonalizable. If a matrix does not have repeated eigenvalue, it always generates enough linearly independent eigenvectors to diagonalize a vector. If it has repeated eigenvalues, there is no guarantee we have enough eigenvectors. Some will not be diagonalizable.
If $A$ is a square matrix with $N$ linearly independent eigenvectors ($v_1$, $v_2$, $\cdots$, $v_n$) and corresponding eigenvalues ($\lambda_1$, $\lambda_2$, $\cdots$, $\lambda_n$), we can rearrange
into
For example,
However, the above method is possible only if $A$ is a square matrix and $A$ has n linearly independent eigenvectors. Now, it is time to develop a solution for all matrices using SVD.
The matrix $AA^{T}$ and $A^{T}A$ are very special in linear algebra. Consider any m × n matrix A, we can multiply it with $A^{T}$ to form $AA^{T}$ and $A^{T}A$ separately. These matrices are
We name the eigenvectors for $AA^{T}$ as $u_i$ and $A^{T}A$ as $v_i$ here and call these sets of eigenvectors $u$ and $v$ the singular vectors of A. Both matrices have the same positive eigenvalues. The square roots of these eigenvalues are called singular values. We concatenate vectors $u_i$ into $U$ and $v_i$ into $V$ to form orthogonal matrices.
SVD states that any matrix A can be factorized as:
S is a diagonal matrix with r elements equal to the root of the positive eigenvalues of $AA^{T}$ or $A^{T}A$ (both matrics have the same positive eigenvalues anyway).
This reduceddimensionality cooccurrence representation preserves semantic relationships between words, e.g. doctor and hospital will be closer than doctor and dog.
Although these methods give us word vectors that are more than sufficient to encode semantic and syntactic (part of speech) information but are associated with many other problems:
1  def compute_co_occurrence_matrix(corpus, window_size=4): 
Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context. The idea is to design a model whose parameters are the word vec tors. Then, train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. Thus, we learn our word vectors.
Word2vec is a software package that actually includes :
First, we need to create such a model that will assign a probability to a sequence of tokens. Let us start with an example:
“The cat jumped over the puddle.”
A good language model will give this sentence a high probability because this is a completely valid sentence, syntactically and semantically. Mathematically, we can call this probability on any given sequence of n words:
We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent:
However, we know this is a bit ludicrous because we know the next word is highly contingent upon the previous sequence of words. And the silly sentence example might actually score highly. So perhaps we let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it. We call this the bigram model and represent it as:
Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence, but as we will see, this representation gets us pretty far along.
One approach is to create a model such that given the center word “jumped”, the model will be able to predict or generate the surrounding words “The”, “cat”, “over”, “the”, “puddle”. Here we call the word “jumped” the context. We call this type of model a SkipGram model.
We breakdown the way this model works in these 6 steps:
How to calculate $P(oc)$? We will use two vectors per word w:
Then for a center word c and a context word o:
1  class SkipGram(nn.Module): 
Another approach is to treat {“The”, “cat”, ’over”, “the’, “puddle”} as a context and from these words, be able to predict or generate the center word “jumped”. This type of model we call a Continuous Bag of Words (CBOW) Model.
We breakdown the way this model works in these steps:
1  class CBOW(nn.Module): 
Lets take a second to look at the objective function. Note that the summation over V is computationally huge! Any update we do or evaluation of the objective function would take O(V) time which if we recall is in the millions. A simple idea is we could instead just approximate it.
For every training step, instead of looping over the entire vocabulary, we can just sample several negative examples! We “sample” from a noise distribution $P_n(w)$ whose probabilities match the ordering of the frequency of the vocabulary. Unlike the probabilistic model of Word2Vec where for each input word probability is computed from all the target words in the vocabulary, here for each input word has only few target words (few true and rest randomly selected false targets). The key difference compared to the probabilistic model is the use of sigmoid activation as final discriminator replacing softmax function in the probabilistic model.
Given this example(We get positive example by using the same skipgrams technique, a fixed window that goes around):
“I want a glass of orange juice to go along with my cereal”
The sampling will look like this:
Context  Word  target 

orange  juice  1 
orange  king  0 
orange  book  0 
orange  the  0 
orange  of  0 
So the steps to generate the samples are:
1  from keras.layers import Merge 
hierarchical softmax is a much more efficient alternative to the normal softmax. In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.
Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf of the tree is a word, and there is a unique path from root to leaf. In this model, there is no output representation for words. Instead, each node of the graph (except the root and the leaves) is associated to a vector that the model is going to learn.
In this model, the probability of a word w given a vector $w_i$, p(ww_i),is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to w. The main advantage in computing the probability this way is that the cost is only O(log(V)), corresponding to the length of the path.
Taking $w_2$ in above figure, we must take two left edges and
then a right edge to reach w2 from the root, so
Therefore,
So far, we have looked at two main classes of methods to find word embeddings.
In comparison, GloVe consists of a weighted least squares model that trains on global wordword cooccurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful substructure. It shows stateoftheart per formance on the word analogy task, and outperforms other current methods on several word similarity tasks.
1  def weight_func(x, x_max, alpha): 
In conclusion, the GloVe model efficiently leverages global statistical information by training only on the nonzero elements in a word word cooccurrence matrix, and produces a vector space with mean ingful substructure. It consistently outperforms word2vec on the word analogy task, given the same corpus, vocabulary, window size, and training time. It achieves better results faster, and also obtains the best results irrespective of speed.
1  class Bicycle 
此时size
能正常响应，但是不能执行数组之间的加法会导致问题. 尽管+
连接的是 Parts
对象，但是+
返回的对象是 Array
实例。Array
并不知道如何响应spares
1  road_bike_parts = Parts.new([chain, road_tire, tape]) 
1  # in forwardable.rb 
The 2 main differences with the def_delegator method is that it takes a set of methods to forward and the methods cannot be aliased
1  require 'forwardable' 
对象如何创建的知识，最好放在工厂里面
。这样你就只需要一个说明书，就能创建对象。
1  road_config = [['chain','10speed'], 