Does anyone know how to solve this question? (RNN) Given ht-1 (R100), and xt (R10 ), state the all variables required to express ht and their dimensionalities. or like Suppose that x(R100) and h(R10). How many parameters would the RNN have in total without bias? Being confused why x and h can have different dimensions? Cannot image how does the RNN look like. Maybe more than one layer for the first question? But how about the second case, x has more dimensions than h? Being grateful for an answer.
Hi, the dimension of x and the dimension of h could be any number. I suggest you check the RNN chapters to understand the concept thoroughly. W_xh and W_hh are shared through time. So, we do not have different W_xh and W_hh's. We only have one. So, the answer is the sum of the parameters in W_xh and W_hh = (dim(x) * dim(h) + dim(h)*dim(h)). In your first question the answer is 10*100 + 100*100 = 11000 and for the second one 100*10 + 10*10 = 1100. I hope this helps.
I think 3 from the covariance matrix and 4 for the p(j), but is it 3(dim)*4 for the means or just 4? Would really appreciate an answer :)
Maus, you forgot that the weights of the covariance matrix are shared. so i think you counted too many parameters..
Does anyone know the answer for this question?
View 1 more comment
if found lambda is either 0 or equal to x, so we need to choose x because otherwise p(x|lambda)=0 and we can't derive
there was a YouTube video calculating this - the result was same as the sample mean of the Gaussian distribution
Does anybody have an idea for the following question? Here in 4)c) it was asked when using sigmoid activation function, NN is not working right, does it help to remove it? Also generally speaking does it help to remove an activation function, if the performance is poor? I saw the same question for ReLU before
Removing the activation function shouldn't do you any good as it removes the nonlinearity (which is what enables learning comlex functions in the first place). You can either switch the activation function to tanh (which is centered around 0) or apply techniques like proper weight initialization, batch normalization etc.
I'd say no because removing the activation functions removes all the nonlinearities. Use tanh instead
Is there a concise answer to this question? I cannot put my thoughts into paper.
Non Paramertic methods do have parameters but these parameters define the model complexity and not the distribution of the model. For eg. Historgrams have bin size as a parameter, when changed that doesnt change the model distribution directly
Does anyone know the Time and Space Complexity of BPTT in dependency of time t? This is often a Question, but I can't find an answer. Thanks!
I am having trouble calculating the parameters for each layer and wondering if my answers are correct: i) 2x2x10= 40 Parameters ii) 2x2x30= 120 Parameters iii) 0 Parameters iv) 450 Parameters
Either way in the second layer you' d have 10 dimensions as an input because the previous convolution had 10 filters as an output. That means the second convolution would have 10x2x2x30 Parameters without the biases. This Link provides helpful explanations about the calculations: https://www.learnopencv.com/number-of-parameters-and-tensor-sizes-in-convolutional-neural-network/
it is 2*2, The solution is available here on studydrive on another file
Does anybody know the answer to this? Either: The number of parameters decreases Or: The number remains the same, because the parameters are set to 0 after training and after it has been determined that the influence of the nodes is too small
there is one trainable hyper-parameter per dropout layer i. e. "the probability of retaining a unit in the network" denoted as 'p', with a range 0.4 ≤ p ≤ 0.8. Link to the dropout paper: http://jmlr.org/papers/v15/srivastava14a.html
No area was marked for this question
Does anyone has an idea of solutions to 1.10, 6.1 and 6.2? Thanks in advance!
There was one more questions (not 100% sure where it was). It was something like: There was a plot given of two error error functions and at some point one dropped and the other one did not. Then it said that the network with the better error function was significantly shallower. Why did the shallower model perform better in this case?
View 1 more comment
My answer was: 1. In deeper models it is more difficult to propagate the gradients. 2. Use LSTMs or GRUs - got full points. There might be other or better answers though
Probably also residual nets? Although they may not tackle the "efficient" tag
You forgot: How does the number of trainable parameters change, when using dropout?
View 1 more comment
My suggestion: "The total number of parameters doesn't change. In each run, the number of parameters that are trained are reduced proportionally to the dropout percentage."
I agree, one could argue that sampling the switch-off schematic follows some probability distribution (in the lecture it says: Switch off units randomly) which might introduce some parameters, but clearly O(add. params) << O (NN params)
Hey Guys, Anyone here has taken "Laboratory: Machine Learning" it is said on the course page that "there will be a short test to ensure that students are prepared at the beginning of each unit. Passing the test is prerequisite for attending a unit." That means if I dont have previous theoretical knowledge I will automatically fail the course? or what is the content of those "entry tests"?
Also want to know about information of the lab!
Last shared documents
David Zanger 68 shared last document 2 months ago
Hi guys! I'm planning to take both courses ML and CV. Which one should I start with? Or doesn't it matter? Are both exams equally difficult? (I'm not a CS student, but a Mechanical Eng. with some basic programming skills and strong interest in IT. Next semester I will switch to the M.Sc. Automation Eng.)
What did you write for the segmentation map question, where the 32x32x4096 feature map had to be transformed back into a full-res segmentation map?
I answered with softmax and multi-class cross-entropy (I think this is the same as softmax loss). But would flattening the output and stacking a simple fully connected layer wouldn't be enough for that task? Maybe I got something wrong in the task. But since deconvolution wasn't a topic of a lecture I do not think that this was needed (but this not mean that using it was wrong).
A fully connected layer would work, it would just have a lot of parameters: 32*32*4096 nodes on the left, 128*128 nodes on the right, so ~68.7 * 10^9 connections between them if fully connected. I think these orders of magnitude are almost impossible in practice.
How do I get the exercise sheets? I am not in the exercise Moodle learning room and can only access the solution zip.
What do you mean with exercise Moodle room? There is no separate Moodle room for the exercise only for the lecture where you also find the exercise sheets and the solutions.
could some one write questions of first exam? whichever you remember is really appreciated
In total: 100pts Tasks with highest number of points/that were surprising: 4pts: Bayes p(X=x,Y=y)=? with numbers for P(X=x,Y\neq y), P(Y=y), P(X=x) given 3pts: Python coding exercise for 2-class linear regression (trainLinReg()) (the only programming task) 6pts: given a distribution ((\lambda^x*exp(-\lambda))/x!), derive the maximum likelihood and the optimal parameter for lambda 7pts: given a computation graph representing a network, give backpropagation steps (derivations) 4pts: given a CNN with 2 convolution layers (2x2), one maxpooling layer and one more conv (3x3) and input size (584x820 or something), give the output dimensions of each layer More unordered Information: 2x2pts: Two questions about k-means (e.g. can it capture ellipsoid clusters), 2x2pts: Two questions about SVM (e.g. If nonlinearity: are slack variables the way to go?) 2x2pts: Residual Networks. Sketch+How it works & What for in deep networks? 2x2pts: What are problems when propagating gradients in RNNs? & How to cope with them? No EM, no AdaBoost, no Backprop Pseudo Code
Was the 4pts CNN question not about how many weights without bias each layer has rather than about the Output dimensions?
Good Luck Guys!
No area was marked for this question
S18: in the lecture Glorot Init is defined as 2/(n_in+n_out)
This seems to be a copy-paste error.
I think saying N is smaller might be a bit misleading. The slide only says, that due to sparsity the effective runtime is between O(N) and O(N^2)
Here the constraint is only a_n >=0. C only gets added in the next step with the slack variables.
Was anyone able to solve question 6.1 and 6.2 of the first exam from ws18/19.
What did Prof. Leibe say about the exam date during the lecture?
TBD
I'm really unsure and there's no explicit mention of this on the slides. We have 4 weight matrices in a LSTM module (1 for each of the contained layers). Each matrix is 110*110. Take 4 of them and get 48400 params. (edit: this doesn't make sense to me anymore)
In my calculations, the LSTM matrices are of size (10 x 110), because they are multiplied with vectors of the shape (110 x 1). These vectors are the concatenation of h_{t-1} and x_t, therefore adding up to 110 dimensions. If we just take these matrices, we have 4400 params.
Not entirely sure: My answer would be: 3* 5*5 * 100 = 7500 The image dimensions shouldn't play a role on the number of parameters. Factor 3 because of 3 color channels (RGB).
Assuming W_{xh} is a 10x100 matrix and W_{hh} a 10x10 matrix, I'd say the result is 1000+100 = 1100 params.
Do we add the channel results together? If yes, my result is a 2x2 matrix: 6, -6 6, 9
What would be the solution here for p1 and p2? My solution is: p1 has target variable -1 and p2 has target variable +1 Or is is the other way around? I am not sure.
View 1 more comment
what do u mean by rewriting? For p1 I am getting a value smaller than -1 and for p2 a value larger than 1!
No area was marked for this question
Hi guys, thanks for posting the questions. For some questions, I'm not sure about the answer. Could someone maybe help me? I ask them in the following: For 3.3: In the second part, the optimization constraint that causes this sparseness, is the third KKT condition the right answer to this? In 5.2: The number of parameters is (476×236×100)×(5×5×3)=? In 5.3: The size of output is (2×2×1)? What are the given numbers useful for? In 5.4: The output size is 1×2 with 6,1? For 6.6: Any ideas? Thanks in advance! :)
checkout the new question sheet. Someone updated that. It is 4x4 actually.
Thank you!
I am very certain that this was a 4x4 matrix. And the result was 2x2.
You lost points if you did not explicitly say that we want to maximise the likelihood and then minimise the (neg) log likelihood.
Here you did not get full points, if you named a non parametric method with parameters. You had to explain why it is called non-parametric.
No area was marked for this question
Another question was: What is sequence learning as opposed to batch learning? And: When do we get vanishing gradients? Exploding gradients? Stable gradients?
No area was marked for this question
Hi Did someone take the exam recently? I know a sample exam was posted but I wanted to know if it was similar last year? Did it include programming? (Like filling out missing code) ? Thanks
No area was marked for this question
No area was marked for this question
Hat jemand hier am ersten Termin teilgenommen und könnte ein Gedächtnisprotokoll der Aufgaben hier veröffentlichten? :)
No area was marked for this question
Hat jemand hier am ersten Termin teilgenommen und könnte ein Gedächtnisprotokoll der Aufgaben hier veröffentlichten? :)
Hat jemand hier am ersten Termin teilgenommen und könnte ein Gedächtnisprotokoll der Aufgaben hier veröffentlichten? :)